HARD 2004 Topics and Annotations LDC2005E29 December 9, 2005 Linguistic Data Consortium 1. Introduction The HARD 2004 Topics and Annotations Corpus was produced by Linguistic Data Consortium (LDC) and contains topics and annotations (clarification forms, responses and relevance assessments) for the 2004 TREC HARD (High Accuracy Retrieval from Documents) Evaluation. HARD 2004 was a track within the NIST Text REtrieval Conference (TREC), with the objective of achieving high accuracy retrieval from documents by leveraging additional information about the searcher and/or the search context, through techniques like passage retrieval and the use of targeted interaction with the searcher. The current corpus was previously distributed to HARD Participants as LDC2004E42 and LDC2005E17. The source data that corresponds to this release is distributed as LDC2005E28, HARD 2004 Text. This corpus was created with support from the DARPA TIDES Program and LDC. Three major annotation tasks are represented in this release: Topic Creation, Clarification Form Responses, and Relevance Assessment. Topics include a short title, query plus context, and a number of limiting parameters known as "metadata" which include targeted geographical region, target data domain or genre, and level of searcher expertise. Clarification Forms are brief HTML questionnaires system developers submitted to LDC searchers to glean additional information about information needs directly from the topic creators. Relevance assessment consisted of adjudication of pooled system responses, and included document-level judgments for all topics, and passage-level relevance judgments for a subset of topics. The release is divided into training and evaluation resources. The training set comprises twenty-one topics and 100 document-level relevance judgments per topic. The evaluation set contains fifty topics, clarification forms and responses, document-level relevance assessment for all topics and passage-level relevance judgments for half of the topics. HARD participants received the reference data over the course of the evaluation cycle in stages: (0) training data (topics, metadata and annotations) (1) evaluation topic descriptions without metadata, (2) clarification form responses, (3) eval topic descriptions with metadata and (4) eval topic relevance assessments. For more information about the HARD 2004 project, please visit http://www.ldc.upenn.edu/Projects/HARD. 2. Topics and Metadata HARD topics are created by LDC annotators based on annotators' actual interests and information needs. A topic is a theme-based research query which is not strictly event-based but is also not overly broad. Generally, HARD topics are research queries, such as "What new uses will we find for corn in the future?" or "How is globalization influencing the Indian media?". Topic information follows the TREC standard and includes a short title, a sentence-long query and a paragraph-long narrative, each of which describes the topic in increasing detail. HARD topics also add Metadata or paramaters that further limit the query space. Each metadata category is assigned a value during topic creation. The goal of the metadata is to develop a sort of personal profile that will differentiate users' results. There are six metadata categories. GENRE refers to desired data domain of results; annotators select "news-report", "opinion-editorial", "other", or "any". GEOGRAPHY refers to the geographical region of desired results; options are "US", "non-US", and "any". GRANULARITY is the level of document detail or amount of text -- entire document or specific passage -- that a topic creator wants his or her results to be in. FAMILIARITY is the level of expertise the topic creator possess in the field of the query; options are "little" or "much". SUBJECT is one of twelve general categories, such as Health&Medicine or Society, into which the topic fits. Finally, RELATED TEXT is an optional part of topic creation, where annotators paste text examples of the kinds of results they are looking for. Annotators used a web-based topic creation form to guide their work. See docs/topic_creation_2004.html 3. Clarification Forms HARD sites had the option of submitting clarification forms to LDC assessors in order to garner additional feedback from topic creators. Clarification forms typically consisted of a short HTML document asking for information like keyword relevance ranking or passage relevance assessment. The following restrictions applied: 1. The CF must display correctly on Netscape V4.78 running on Solaris 2.5.1 2. The CF cannot be larger than can be displayed on a 16-inch monitor (an earlier draft indicated incorrectly that a 17-inch monitor was the minimum) 3. The screen real estate you have available is 1152 x 900 4. The CF must be an HTML Web page. No Javascript, no Java, no flash, no anything but HTML. 5. The page may not refer to external images: it must be self-contained 6. The following types of data entry will be permitted (others are possible, but check in advance): * text boxes * radio and check buttons * drop-down menu selections The assessor will spend no more than three (3) minutes filling out the form for a particular topic, meaning up to 150 minutes per site. After receiving the forms, LDC annotators logged into a web-based system that displayed forms for that user's set of topics. Forms for each topic were displayed in random order (rather than alphabetical by the name of the site which could lead to bias). User judgments were logged to a database. 3. Relevance Annotations 3.1 Training data To provide training data for HARD, the HARD corpus was indexed using local tools, and a relevance-ranked list of 100 documents was returned to the annotator. LDC annotators assessed these documents using an annotation tool developed specifically for this task. Documents received one of three labels: 1) RELEVANT (also, HARD-rel, value=1): The document is both relevant to the topic statement and meets all "metadata" restrictions 2) ON-TOPIC (also, SOFT-rel, value=0.5): The document is relevant to the topic statement but fails to meet all "metadata" restrictions (Genre, Familiarity, Geography) 3) OFF-TOPIC (value=0): The document is not at all relevant to the topic statement 3.2 Evaluation Data For assessing document relevance for the 50 evaluation topics, NIST distributed pooled site results to LDC (85 documents per site, per topic). LDC then used local annotation tools to assess document relevance using the three labels described above. Twenty-five HARD2004 topics were also reviewed for passage-level relevance, as specified in the metadata GRANULARITY value. The 25 topics are: HARD-407 HARD-408 HARD-410 HARD-412 HARD-413 HARD-415 HARD-416 HARD-420 HARD-421 HARD-422 HARD-423 HARD-424 HARD-425 HARD-426 HARD-427 HARD-428 HARD-429 HARD-435 HARD-439 HARD-442 HARD-443 HARD-444 HARD-445 HARD-446 HARD-449 The documents for these topics were further annotated for passage-level relevance where the document label was RELEVANT or ON-TOPIC. LDC's HARD annotation tool launches a second application for passage-level retrieval when assessors judge a document to be RELEVANT. For ON-TOPIC documents a wrapper is used to launch the passage retrieval tool to extract passages after all other annotation is complete. The reason for the difference in approach to RELEVANT versus ON-TOPIC passages is that the annotation tool did not originally support ON-TOPIC passage extraction. 4. Workflow and Quality Control HARD annotation workflow was controlled by AWS. AWS is an automated workflow system developed by LDC that assigns topics, files and tasks to annotators according to their managers' specifications. The system allows for multiple workflows depending on task staging and project requirements. A unique feature of the HARD 2004 annotation process is that each topic was annotated from start to finish by the same annotator who originally devised the topic, which approximated an end-user scenario. Sites were able to interact more or less directly with the topic creators, as a search engine would with a user. Topics were reviewed by managers and senior annotators to check spelling, consistency, and thoroughness. Clarification forms were reviewed by managers and topic creators to ensure that all forms had been answered completely. Quality control measures for the evaluation relevance assessment task involved managers, technical support staff, and annotators, who performed the following checks on the data: o Technical staff - Confirmed that LDC's passage results match NIST's passage output - Confirmed that LDC judged the correct documents for each topic - Removed the "docs.excluded.from.results" documents o Managers - Spot-checked labels against topic descriptions - Confirmed that granularity of annotated topics matches granularity sent to sites - Modified assessments based on annotator quality control o Annotators - Reviewed lists of all RELEVANT and ON-TOPIC stories for their topics to ensure that their judgments were consistent. 5. Annotated Data Profile The table below summarizes the volume and type of annotations provided by LDC for the HARD2004 evaluation: Data Type Training Evaluation ------------------------------------------------------------------- Topics 21 50 Clarification form responses 0 2,800 Document relevance judgments 2,100 36,938 Passage relevance judgments 0 2,767 6. Source Data Profile The corpus comprises eight English newswire and web text sources from January-December 2003. The sources are: AFE: Agence France Presse - English APE: Associated Press Newswire CNE: Central News Agency Taiwan - English LAT: Los Angeles Times/Washington Post NYT: New York Times SLN: Salon.com UME: Ummah Press - English XIE: Xinhua News Agency - English Volume of data for each source appears in the table below: Source Stories Total Tokens Average Token/Story ---------------------------------------------------------- AFE: 226,515 71,829,978 317 APE: 237,067 93,294,584 393 CNE: 3,674 797,194 217 LAT: 18,287 12,576,721 687 NYT: 28,190 16,673,028 591 SLN: 3,321 4,710,500 1,418 UME: 2,607 782,064 299 XIE: 117,854 24,016,670 203 Total: 637,515 224,680,739 7. Directory Structure /docs - contains annotation guidelines and other corpus documentation. /training /topics - contains training topic descriptions /annotations - contains relevance assessments /evaluation /topics - contains training topic descriptions /annotations - contains relevance assessments /clarification_forms /forms - contains .html CFs submitted by HARD sites /responses - contains annotator responses to CFs 8. File Format Description 8.1. Topics Topic descriptions are contained in a plain text file with XML tags as follows: Hard-nnn Short, few words description of topic Sentence-length description of topic. Paragraph-length description of topic. No mention of restrictions captured in the metadata should occur in this section. This is intended primarily to help future relevance assessors. No specific format is required. If the topic file also includes metadata, the specification is as follows: Hard-nnn Short, few words description of topic Sentence-length description of topic. Paragraph-length description of topic. No mention of restrictions captured in the metadata should occur in this section. This is intended primarily to help future relevance assessors. No specific format is required. Spells out how the author intends their metadata items to be interpreted in the context of the topic. This provides a check that everyone understands the metadata in the same way and how it affects relevance. passage | document little | much news-report | opinion-editorial | other | any US | non-US | any On-topic but not relevant text Relevant text free text entry 8.2. Clarification forms Clarification forms are in HTML format. No strict guidelines regarding original format were circulated to the community. The only restrictions were that cfs be displayable by Netscape 4.78, not contain JavaScript, and include a cgi-script that would log the results to each form on LDC servers. See http://www.ldc.upenn.edu/Projects/HARD/cfs.html for more details. 8.3. Annotations Relevance table formats are described in README files within each annotation directory. 9. Contact Information Further information about this data release can be obtained by contacting the Linguistic Data Consortium HARD 2004 managers: - Meghan Glenn, Lead Annotator (mlglenn@ldc.upenn.edu) - Stephanie Strassel, Associate Director, Annotation Research & Program Coordination (strassel@ldc.upenn.edu) For further information about the HARD project at LDC, visit http://www.ldc.upenn.edu/Projects/HARD For more information about current efforts in the HARD track, and for detailed guidelines for the research community, the Center for Intelligent Information Retrieval at the University of Massachusetts maintains an up-to-date website. http://ciir.cs.umass.edu/research/hard 10. Update Log Readme created by Meghan Glenn, October 28, 2005 Updated by Stephanie Strassel, December 9, 2005