HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation Linguistic Data Consortium Authors: Amanda Morris, Stephanie Strassel, Xuansong Li, Brian Antonishek, Jonathan G. Fiscus 1. Introduction To advance multimodal event detection and related technologies, Linguistic Data Consortium (LDC) in collaboration with NIST (the National Institute of Standards and Technology) has developed a large, heterogeneous annotated multimodal corpus for the HAVIC (the Heterogeneous Audio Visual Internet Collection) program. As an ongoing effort, the corpus consists of user-generated videos with content occurring in the audio, video, and text embedded in the video, covering multidimensional variations inherent in user-generated video content, including variable camera motion, subject topicality, low and high quality video resolution and compression, competing background noise, spontaneous or concurrent speech, far field speech, multiple languages, and so on. The data is used to train, test and evaluate multimedia systems. The HAVIC data has been used for the Multimedia Event Detection (MED) task in TRECVID (TREC Video Retrieval Evaluation) for several years, such as MED-10 (the MED task for 2010). The TREC (Text REtrieval Conference) conference series is sponsored by NIST with additional support from other U.S. government agencies. The MED task aims for the development and evaluation of core detection multimedia systems which can quickly and accurately search a multimedia collection for user-defined events that include a person interacting with another person or object. The data developed for the MED task is comprised of videos of various events (called event videos) as well as videos completely unrelated to events (called background videos). Each collected event video is manually annotated with a set of judgments describing its event properties and other salient features. Each background video is labeled with topic and genre categories. The data was previously used only by HAVIC/MED performers, and now are accessible to the general public via LDC catalog. This corpus is a collection of event (E051-E060) videos for the HAVIC Project. It was originally released (LDC2016E48) to support 2016 Multimedia Event Detection task. Annotation and metadata of video data are also included in this release. 2. Data Profile video total | size total video duration ------------|--------------------------- 1498 | 18GB 53.7 hours 3. Data Collection and Harvesting HAVIC data includes event as well as background videos. The collection and harvesting process targets these two types. 3.1 Collection and Scouting Process LDC hired and trained a large team of human annotators known as data scouts to search for suitable video content. Annotators are given two types of predefined topics to search for videos: list of event topics and background topics. Event topics target event video collection. Background topics target background video collection. LDC has developed a web-based user interface and backend framework for HAVIC collection known as AScout. The AScout user interface is a Firefox add-on consisting of an annotation form. Data scouts use the browser in the usual way to search, navigate video websites and watch videos. When they find a suitable video they fill out the AScout form, and the results are logged to a database. Upon signing in to AScout, data scouts are given a specific scouting assignment, which typically consists of a target number of instances for a specific event plus some number of background clips (i.e., videos which do not contain one of the MED events). For instance, in a given scouting session a scout may be directed to find 10 event videos and 25 background videos, and a counter in the AScout interface shows progress toward that goal. 3.2 Harvesting and Formatting Process In order to reduce the likelihood of duplicate video submissions, the AScout annotation framework includes a URL-based duplication checking routine. The AScout framework includes a URL normalization function that standardizes variant URLs for the same video on a host site, thus increasing the chances that a duplicate URL will be detected. The md5 hash of the normalized URL is stored with the annotations in the database. When the data scout enters the video URL into the AScout form, AScout normalizes it and checks the database to see if its md5 hash already exists there. If so, AScout warns the user and disables submission of the form. Harvesting submitted videos is supported by a downloader script at regular intervals, ranging from once every five minutes to once daily. Each video host site requires a specific set of database queries and download execution syntax, which are parsed in from XML configuration files. After this information is read, the downloader script queries the database and generates a list of URLs. During the download step, the script first checks the clip repository to see if the file already exists. If not, it executes the clip download command - typically wget or curl, but frequently an open-source application unique to a single site and then performs a clip check and conversion routine once the download has been completed. Media files in the corpus are required to be in MPEG-4 format, with h.264 video encoding and AAC audio encoding. The clip checker/converter verifies that the file is a valid video file. If it is already a valid video file in the correct format, then the database is automatically updated with the video file name. If the file is not a valid video format, such as in the case where a video is removed and the downloader retrieves a 404 html file, the annotated clip is flagged and the annotation is listed as a failed download. After the checks and conversion, additional clip metadata is generated including video duration, codec, unique randomized ID number, and md5 checksum. 4. Data Annotation 4.1 Annotation Approach Annotation is also supported by the AScout tool where annotators annotate videos for a variety of features. Annotation consists of several stages. Stage One: Event Selection and Definition A pool of 164 event candidates are first created and then narrowed down to 75 events. The events were selected according to a rigorous process to both protect privacy and appropriately sample the domain space to provide a meaningful measure of research progress along several dimensions. For each of the 75 official events selected, LDC then creates a textual "event kit" which is a textual description of the event properties along with a few exemplar videos. The event kit is used during manual annotation of collected HAVIC videos. Each event kit consists of an event name which is a mnemonic title for the event, an event definition which is a textual definition of the event, an event explication which is a textual exposition of the terms and concepts used in the event definition, an evidential description is a textual listing of attributes that are indicative of an event instance, and a set of exemplars, which are illustrative video examples each containing an instance of the event. Stage Two: Annotation AScout requires annotators to make a number of decisions about each video they submit. These include: --logging info about the page URL for the clip, along with the likely download URL. --assigning the video to a genre --writing a brief synopsis of the video content. --flagging any problematic content, such as personal sensitive information. With flagged clips, no further annotation is required and the video is excluded from the HAVIC corpus. --assigning a topic category, using a pull-down list in AScout, which may consist of general topic categories like FOOD & DRINK or SPORTS, or event-based topics like LANDING A FISH or MAKING A CAKE. If a video clip being submitted meets the definition of one event, annotators choose that event-based topic from the pull-down list. Otherwise, they select the general topic that is most relevant to the video. --assigning license type for the video. Background video clips require no further annotation after this step of annotaiton. --event clip annotation: clips containing a target event require several additional annotations. These include: --determining whether the video shows a true instance of the event (a positive instance), or whether the video does not quite meet the requirements for a positive instance (a near miss). To make this determination, annotator consults information in the textual event kit. --summarizing the type of evidence, including the presence of visual evidence, audio evidence, and text evidence. --optionally listing individual items considered to be evidence for the event, such as the particular objects, people, or activities shown in the video, or particular words spoken or shown in text. --flagging the presence of foreign-language speech or text, and indicating if steps of the event are narrated in the video. 4.2 Quality Control To assure quantity and and quality, several steps of QC measures have been implemented: Pre-training: Candidate scouts are required to take a pre-screening test that assesses their ability to quickly locate novel video content on the web under time pressure, and to identify problematic content (which should not be included in the corpus) in a set of existing videos. Scouts who pass the preliminary screening undergo a roughly two week training process to familiarize them with the goals of the HAVIC collection, event definitions and other project requirements. Ongoing training: data scout on-going training includes biweekly group meetings, review of individual annotators' work by senior project staff and other quality control measures. For instance, a portion of all videos in the corpus are reviewed during a second pass by senior annotators. Event instances are highest priority for second passing, but a random sampling of background clips are also reviewed. The primary goal of second passing is to make sure the video's annotations are accurate and also that the video is usable. Additional quality control is performed continuously via manual inspection of the annotation table. LDC further maintains a mailing list and wiki for data scouts to discuss questions, share scouting tips and record decisions. Quality Control versus Corpus Variation HAVIC data are harvested and annotated over a long duration. At the same time, one of the target is to achieve content variation. Therefore, corpus content variation and annotation variation are obvious, particulary: --The pre-defined list of general topics are designed with broad categories. However, as some videos may relate to multiple topics, and so there will often be videos that could fit in multiple topic categories. In the case where a video fits multiple general topics, data scouts are instructed to select the topic they find most appropriate. A video showing a child walking a dog might be given the topic CHILDREN by one annotator, ANIMALS by another annotator, and OUTDOOR ACTIVITIES by a third annotator. This is an expected feature of the corpus. --Another source of annotation variation relates to the event-specific topics and the fact that the corpus is not exhaustively annotated for every event. When a data scout observes an instance of a MED event they are instructed to label it as such, but it is not expected that scouts will consider every clip they view against the entire set of events. In fact, because the events were developed over time, not all events were known during early phases of the corpus collection and annotation process. As a result, the corpus is known to contain some number of videos that are assigned to general topics but which depict an event. Also, some videos may contain instances of more than one event without being explicitly labeled as such. --When it comes to assessing individual videos against a single topic, annotators typically show high agreement. That is to say, most of the time most annotators agree about whether a given video does or does not contain a positive instance of an event like MAKING A SANDWICH. However, we expect there to be a number of judgment calls where well-trained, reasonable annotators simply disagree with one another. For instance, if the video shows a person spreading butter on a piece of bread, folding it in half and eating it, annotators may not agree on whether that video is a positive instance of MAKING A SANDWICH. This type of annotator variation is expected, and accepted, in the corpus though it has not been quantified or measured to date. 4.3 IPR Review While there is a potentially unlimited supply of user-generated video content on the Internet, not all of it is suitable for inclusion in HAVIC and an important part of the data scout's job is discerning what videos should be excluded. One part of this consideration is the status of a given video with respect to intellectual property rights (IPR). Scouts are given a list of approved video host sites whose terms of use are compatible with the intended use of the HAVIC corpus. Further, scouts are instructed to select individual videos with an appropriate license, for instance a type of Creative Commons license that permits redistribution. 5 Data Formats 5.1 Video Data All video files are in .mp4 format (h.264), with varying bit-rates and levels of audio fidelity and video resolution. 5.2 Metadata and Annotation Data Metadata and annotation for videos are stored in .tsv file with the following columns. Items with * are required for all clips. Items with + are optional or only occur for clips with certain properties. * column 1: ID - unique randomized numerical ID for this video/metadata entry. * column 2: ORIGINAL_CLIP_ID - original chronological unique clip identifier number assigned during data collection * column 3: ORIGINAL_MEDIA_FILE - original chronological filename for the media file corresponding to this entry. * column 4: MEDIA_FILE - randomized, non-chronological filename for the media file corresponding to this entry. * column 5: MD5SUM - md5 checksum for the media file corresponding to this entry. * column 6: CODEC - codec of the media file corresponding to this entry. * column 7: DURATION - duration in seconds for the media file corresponding to this entry. * column 8: ANNOTATOR_ID - unique annotator ID for the person who annotated this entry * column 9: DATE_FOUND - when the video was logged by the scout * column 10: GENRE - genre assigned by the data scout * column 11: SYNOPSIS - brief text synopsis of the video provided by the data scout * column 12: ASSIGNED_TOPIC - topic area assigned to the data scout to guide searching efforts. * column 13: ANNOTATED_TOPIC - topic label selected by the data scout for this specific clip. + column 14: EVENT - the event portrayed in this video, as selected by the annotator from the list of MED events. + column 15: INSTANCE_TYPE - indicating if this video show a true instance of the event. o positive: a true instance of the event o near_miss: somewhat related to the event, but fails to meet the definition in some crucial way and so does not constitute a true instance o not_sure: annotator could not decide between positive/near miss instance. May be reviewed during second passing annotation. + column 16: INSTANCE_COMMENT - optional comment providing additional information about why the clip was labeled as a near miss. + column 17: INSTANCE_VARIETY - subjective annotator judgment about whether the video is atypical compared to other event instances. + column 18: INSTANCE_COMPLEXITY - subjective annotator judgment about whether the video is more complex compared to other event instances. + column 19: AUDIO_EVIDENCE - indicating if the video contains audio evidence for the event. + column 20: NARRATIVE_AUDIO - indicating if the clip includes someone explaining or describing (some or all of) the event step-by-step as they perform it. + column 21: NON_ENG_SPEECH - indicating if the clip contains any speech that is not in English. + column 22: TEXT_EVIDENCE - indicating if the clip contains text evidence for the event. + column 23: NARRATIVE_TEXT - indicating if the clip includes text explaining or describing (some or all of) the event step-by-step as it is performed. + column 24: NON_ENG_TEXT - indicating if the clip contains text that is not in English. + column 25: SCENE - optional data scout description of the video's scene/setting. + column 26: OBJECTS - optional data scout description of objects/people appearing in the video. + column 27: ACTIVITIES - optional data scout description of activities appearing in the video. * column 28: REVIEW_REQUESTED - indicating if the data scout request supervisor review of this clip. altered from this clip + column 29: VISUAL_EVIDENCE - indicating if the video contains visual evidence for the event. + column 30: PEOPLE - optional data scout description of the people appearing in the video. + column 31: OTHER_VISUAL - optional data scout description of any other visual evidence in the video. + column 32: SPEECH - optional data scout description of the speech evidence in the video. + column 33: NOISE - optional data scout description of the (non-speech) noise evidence in the video. + column 34: OTHER_AUDIO - optional data scout description of any other audio evidence in the video. + column 35: EDITED_TEXT - optional data scout description of text evidence edited into the video. + column 36: EMBEDDED_TEXT - optional data scout description of text evidence naturally occurring in the video. + column 37: OTHER_TEXT - optional data scout description of any other textual evidence in the video. 6 Corpus Structure This release package is structured as: ./README.txt ./video ./video/CHECKSUMS ./video/*.mp4 ./docs ./docs/eventtexts/*.txt (files describing 10 events) ./docs/metadata_annotation/metadata_and_annotation.tsv ./docs/guidelines/AScoutGuidelines_V3.3_general_pub.pdf ./docs/papers/lrec2012-creating-havic.pdf 7. Copyright Info Portions (c) 2011, 2012, 2013, 2014, 2015, 2016, 2017 Trustees of the University of Pennsylvania. 8. Acknowledgements Great contributions from the HAVIC team at LDC who support this task on a daily basis: Haejoong Lee: supporting annotation infrastructure Chris Caruso: supporing collection and database Kevin Walker: supporing collection infrastruction Denise DiPersio: supporting IPR/Licensing process Dave Graff: supporting sanity check process Daniel Jaquette: supporting data release process Ilya Ahtaridis: supporting data release process 9. Contacts strassel@ldc.upenn.edu, Stephanie Strassel (PI) xuansong@ldc.upenn.edu, Xuansong Li (HAVIC Project Manager) ========================= README created June 14, 2016