HAVIC Pilot Transcription Linguistic Data Consortium Authors: Jennifer Tracey, Stephanie Strassel, Amanda Morris, Xuansong Li, Brian Antonishek, Jonathan G. Fiscus 1. Introduction To advance multimodal event detection and related technologies, Linguistic Data Consortium (LDC) in collaboration with the NIST (the National Institute of Standards and Technology) has developed a large, heterogeneous annotated multimodal corpus for the HAVIC (the Heterogeneous Audio Visual Internet Collection) program. As an on-going effort, the corpus consists of user-generated videos with content occurring in the audio, video, and text embedded in the video, covering multi-dimensional variations inherent in user-generated video content, including variable camera motion, subject topicality, low and high quality video resolution and compression, competing background noise, spontaneous or concurrent speech, far field speech, multiple languages, and so on. The data is used to train, test and evaluate multimedia systems. The HAVIC data has been used for the Multimedia Event Detection (MED) task in TRECVID (TREC Video Retrieval Evaluation) for several years, such as MED-10 (the MED task for 2010). The TREC (Text REtrieval Conference) conference series is sponsored by NIST with additional support from other U.S. government agencies. The MED task aims for the development and evaluation of core detection multimedia systems which can quickly and accurately search a multimedia collection for user-defined events that include a person interacting with another person or object. The data developed for the MED task is comprised of videos of various events (called event videos) as well as videos completely unrelated to events (called background videos). Each collected event video is manually annotated with a set of judgments describing its event properties and other salient features. Each background video is labeled with topic and genre categories. The data was previously used only by HAVIC/MED performers, and now are accessible to the general public via LDC catalog. This corpus is an effort to support a transcription pilot experiment for the HAVIC Project, with the goal to produce a verbatim transcript (QRTR style--quick and rich transcription) based on English speech audio extracted from YouTube videos. It contains the pilot transcripts for MED-11 video files, along with the associated videos. The data were originally distributed to MED performers in LDC2012E08. 2. Data Selection Videos used for transcription were selected by NIST from the MED DEV corpus by according to the following proportion standard (given the target selected hour total is 100 hours): -- 20 % of the videos contain positive instances of the 5 training events, with a total of 20 hours for 5 events (approximately 4 hours 80 clips per event) -- 80 % of the videos are background videos, with a total of 80 hours of videos containing no events which are related to the 5 training events. -- Selection is not based on the language annotations(English/non-English). -- LDC processes up to 3 continuous minutes per clip. For clips longer than 3 minutes the transcribed segment are randomly chosen. -- During transcripiton process, clips can be rejected or abandoned for reasons including: poor audio quality, non-English language, etc. 3 Data Profile Language| FileTotal Duration(hours) --------|-------------------------------- English | 2395 72.6 4. Annotation 4.1 Transcription Approach For transcription, LDC annotators are not be required to view video data as an aid to transcribing the audio data. Transcription is done using LDC's transcription tool XTrans, which shows the audio waveform and transcript text in separate panes of one window on the console screen; the video file is presented in a separate window (e.g. using a typical browser tool). The video is used to disambiguate speaker ID and help transcribers deal with difficult regions in the audio. General practice is to view the video once before beginning transcription and refer back to it as needed while transcribing. We target a quick rich transcription standard where all content words are transcribed accurately. Given the difficulty of this data set, transcribers make frequent use of the (( )) convention, which denotes unintelligible speech by an identified speaker. Utterances by each speaker are marked distinctly in the transcript text display, and unique speaker IDs are assigned to the best of the transcriber's ability. When transcribers can identify the presence of human speech but both the speaker ID and the content of the speech are unidentifiable (for example, background speech in a crowd), the speaker ID "unintelligible" is assigned to the segment and no content is transcribed. Additionally there are 3 independent noise tracks annotated (these appear as speaker IDs in the TDF transcript): - singing: segments of singing are marked by begin and end times. No other features are annotated for this track. - music: segments of instrument-based music are marked by begin and end times. No other features are annotated for this track. - noise: segments of non-vocally generated noises excluding music are marked by begin and end times. No other features are annotated for this track. Segments having different "track" labels can have overlapping time-stamps (e.g. a segment containing an utterance by "speaker1" can overlap with a segment containing "music" or "noise"). Note: Transcription of a clip can be rejected or abandoned for reasons including: poor audio quality, non-English language, etc. 4.2 Quality Control As a pilot effort, the transcription task was designed with a single-pass pipeline. Each file is transcribed by a single annotator, with no corpus-wide second pass. A sample of files from each transcriber (5-10%) were spot-checked throughout the annotation effort. Spot-checks looked for errors such as missing transcription, improper use of mark-up conventions, and poor segmentation, as well as any transcribed content that was clearly incorrect. Given the difficulty of the data and frequent use of the (( )) notation for transcriber uncertainty, review of the transcribed content looked for missing or added words and words that were very clearly audible and mistranscribed. Automatic checks looked for malformed mark-up notations and misspellings of the manually-entered speaker ID "unintelligible". 5. Corpus Structure The corpus content is organized as follows: data/ transcript/ -- contains 2395 *.tdf text files video/ -- contains 2395 *.mp4 media files docs/ HAVIC-Transcription-guidelinesV1.7.pdf TDF_format.txt -- description of tdf file format data_file_md5s.txt -- path names and MD5 checksums of data files transcript_info.tab -- summary info for transcripts files video_info.tab -- summary info for video files The files in data/transcript and data/video have matching file names (e.g. transcript/HVC001299.tdf matches video/HVC001299.mp4). See section 4 below for descriptions of data file formats and contents of the tables in docs/. 6. Data Formats Relative to previous, project-internal releases of the transcript data, the present release uses names for the transcript files that match the original names of the video files. All transcription files are in .tdf format, which is a plain-text, flat-table format with 13 tab-delimited fields; this is the default file format for Xtrans, LDC's transcription tool. For details of the format, see docs/TDF_format.txt. All video files are in .mp4 format (h264), with varying bit-rates and levels of audio fidelity and video resolution. In the docs/ directory: - data_file_md5s.txt was created using the standard UNIX/Linux "md5sum" utility; each line contains the MD5 signature and path name (relative to the directory that contains "data/"); these fields are separated by two spaces. - transcript_info.tab is a typical tab-delimited flat table file; the first line contains column labels, as follows: Col# Label 1 FILENAME 2 FIRSTTS -- offset in seconds to start of first time-stamped segment 3 LASTTS -- offset in seconds to end of last time-stamped segment 4 NSEGS -- number of segments marked in the file 5 NTOKENS -- number of word tokens in transcribed segments 6 NTRACKS -- number of distinct speaker or "track" labels used 7 TRACK_LABELS -- ";"-separated list of distinct track labels (e.g. "music ; noise ; speaker1") When NTOKENS is zero (0), this indicates that no speech was transcribed in the given file (i.e. all segments in the file contain just music, noise, singing, unintelligible, etc.). NTRACKS ranges between 1 and 29, and corresponds to the number of ';'-separated strings in the "TRACK_LABELS" column; the latter represents all the distinct labels used to categorize the segments in the given file, which may include first names of individuals, anonymous speaker labels ("speaker1", etc.), and types of non-speech content (music, noise, singing, etc.). - video_info.tab is a tab-delimited flat table file; the first line contains column labels as follows: Col# Label 1 FILENAME 2 DURATION -- e.g. "ln: 114.0" 3 AUDIO_INFORMATION -- e.g. "aud: aac, 44100 Hz, stereo, fltp, 123 kb/s (default)" 4 VIDEO_INFORMATION -- e.g. "vid: h264 (High), yuv420p, 504x380, 528 kb/s, 30 fps, 30 tbn, 60 tbc (default)" Note that the columns 2-4 contain space-separated tokens: the "DURATION" column is fixed-width (variable spacing for human readability), and the two "INFORMATION" columns are variable width, with potentially variable numbers of space-separated tokens; each column begins with a mnemonic ("ln:", "aud:", "vid:") to help in differentiating the content of the line as a whole. 7. Known Issues In comparing the video_info and transcript_info tables, users may notice that the video file duration is sometimes slightly less than the "last time-stamp" (LASTTS) value given for the corresponding transcript file. This appears to be due mostly to "rounding error", or sometimes due to inaccuracy of the process used to extract the video file duration. It affects 361 of the 2395 file pairs, and in 241 of these cases, the discrepancy is less than 0.1 seconds; the largest duration mismatch is 0.52 seconds (HVC458534.mp4 is reported to be this much shorter than the last time-stamp in HVC458534.tdf). 8. Copyright Info Portions (c) 2011, 2012, 2013, 2014, 2015 Trustees of the University of Pennsylvania. 9. Contacts strassel@ldc.upenn.edu, Stephanie Strassel (PI) xuansong@ldc.upenn.edu, Xuansong Li (Havic Project Manager) garjen@ldc.upenn.edu, Jennifer Tracey (Transcription Manager) ========================= README created December 22, 2015