HARD 2004 Topics and Annotations
LDC2005E29
December 9, 2005
Linguistic Data Consortium
1. Introduction
The HARD 2004 Topics and Annotations Corpus was produced by Linguistic Data
Consortium (LDC) and contains topics and annotations (clarification forms,
responses and relevance assessments) for the 2004 TREC HARD (High Accuracy
Retrieval from Documents) Evaluation. HARD 2004 was a track within the
NIST Text REtrieval Conference (TREC), with the objective of achieving high
accuracy retrieval from documents by leveraging additional information
about the searcher and/or the search context, through techniques like
passage retrieval and the use of targeted interaction with the searcher.
The current corpus was previously distributed to HARD Participants as
LDC2004E42 and LDC2005E17. The source data that corresponds to this release
is distributed as LDC2005E28, HARD 2004 Text. This corpus was created with
support from the DARPA TIDES Program and LDC.
Three major annotation tasks are represented in this release: Topic
Creation, Clarification Form Responses, and Relevance Assessment. Topics
include a short title, query plus context, and a number of limiting
parameters known as "metadata" which include targeted geographical region,
target data domain or genre, and level of searcher expertise.
Clarification Forms are brief HTML questionnaires system developers
submitted to LDC searchers to glean additional information about
information needs directly from the topic creators. Relevance assessment
consisted of adjudication of pooled system responses, and included
document-level judgments for all topics, and passage-level relevance
judgments for a subset of topics.
The release is divided into training and evaluation resources. The
training set comprises twenty-one topics and 100 document-level relevance
judgments per topic. The evaluation set contains fifty topics,
clarification forms and responses, document-level relevance assessment for
all topics and passage-level relevance judgments for half of the
topics. HARD participants received the reference data over the course of
the evaluation cycle in stages: (0) training data (topics, metadata and
annotations) (1) evaluation topic descriptions without metadata, (2)
clarification form responses, (3) eval topic descriptions with metadata and
(4) eval topic relevance assessments.
For more information about the HARD 2004 project, please visit
http://www.ldc.upenn.edu/Projects/HARD.
2. Topics and Metadata
HARD topics are created by LDC annotators based on annotators' actual
interests and information needs. A topic is a theme-based research query
which is not strictly event-based but is also not overly broad. Generally,
HARD topics are research queries, such as "What new uses will we find for
corn in the future?" or "How is globalization influencing the Indian
media?". Topic information follows the TREC standard and includes a short
title, a sentence-long query and a paragraph-long narrative, each of which
describes the topic in increasing detail.
HARD topics also add Metadata or paramaters that further limit the query
space. Each metadata category is assigned a value during topic
creation. The goal of the metadata is to develop a sort of personal profile
that will differentiate users' results. There are six metadata categories.
GENRE refers to desired data domain of results; annotators select
"news-report", "opinion-editorial", "other", or "any". GEOGRAPHY refers to
the geographical region of desired results; options are "US", "non-US", and
"any". GRANULARITY is the level of document detail or amount of text --
entire document or specific passage -- that a topic creator wants his or
her results to be in. FAMILIARITY is the level of expertise the topic
creator possess in the field of the query; options are "little" or "much".
SUBJECT is one of twelve general categories, such as Health&Medicine or
Society, into which the topic fits. Finally, RELATED TEXT is an optional
part of topic creation, where annotators paste text examples of the kinds
of results they are looking for.
Annotators used a web-based topic creation form to guide their work. See
docs/topic_creation_2004.html
3. Clarification Forms
HARD sites had the option of submitting clarification forms to LDC
assessors in order to garner additional feedback from topic creators.
Clarification forms typically consisted of a short HTML document asking for
information like keyword relevance ranking or passage relevance
assessment. The following restrictions applied:
1. The CF must display correctly on Netscape V4.78 running on Solaris 2.5.1
2. The CF cannot be larger than can be displayed on a 16-inch monitor
(an earlier draft indicated incorrectly that a 17-inch monitor was
the minimum)
3. The screen real estate you have available is 1152 x 900
4. The CF must be an HTML Web page. No Javascript, no Java, no flash, no anything but HTML.
5. The page may not refer to external images: it must be self-contained
6. The following types of data entry will be permitted (others are possible, but check in advance):
* text boxes
* radio and check buttons
* drop-down menu selections
The assessor will spend no more than three (3) minutes filling out the form
for a particular topic, meaning up to 150 minutes per site.
After receiving the forms, LDC annotators logged into a web-based system
that displayed forms for that user's set of topics. Forms for each topic
were displayed in random order (rather than alphabetical by the name of the
site which could lead to bias). User judgments were logged to a database.
3. Relevance Annotations
3.1 Training data
To provide training data for HARD, the HARD corpus was indexed using local
tools, and a relevance-ranked list of 100 documents was returned to the
annotator. LDC annotators assessed these documents using an annotation
tool developed specifically for this task.
Documents received one of three labels:
1) RELEVANT (also, HARD-rel, value=1): The document is both relevant to
the topic statement and meets all "metadata" restrictions
2) ON-TOPIC (also, SOFT-rel, value=0.5): The document is relevant to the
topic statement but fails to meet all "metadata"
restrictions (Genre, Familiarity, Geography)
3) OFF-TOPIC (value=0): The document is not at all relevant to the topic
statement
3.2 Evaluation Data
For assessing document relevance for the 50 evaluation topics, NIST
distributed pooled site results to LDC (85 documents per site, per topic).
LDC then used local annotation tools to assess document relevance using the
three labels described above.
Twenty-five HARD2004 topics were also reviewed for passage-level relevance,
as specified in the metadata GRANULARITY value. The 25 topics are:
HARD-407 HARD-408
HARD-410 HARD-412
HARD-413 HARD-415
HARD-416 HARD-420
HARD-421 HARD-422
HARD-423 HARD-424
HARD-425 HARD-426
HARD-427 HARD-428
HARD-429 HARD-435
HARD-439 HARD-442
HARD-443 HARD-444
HARD-445 HARD-446
HARD-449
The documents for these topics were further annotated for passage-level
relevance where the document label was RELEVANT or ON-TOPIC. LDC's HARD
annotation tool launches a second application for passage-level retrieval
when assessors judge a document to be RELEVANT. For ON-TOPIC documents a
wrapper is used to launch the passage retrieval tool to extract passages
after all other annotation is complete. The reason for the difference in
approach to RELEVANT versus ON-TOPIC passages is that the annotation tool
did not originally support ON-TOPIC passage extraction.
4. Workflow and Quality Control
HARD annotation workflow was controlled by AWS. AWS is an automated
workflow system developed by LDC that assigns topics, files and tasks to
annotators according to their managers' specifications. The system allows
for multiple workflows depending on task staging and project
requirements. A unique feature of the HARD 2004 annotation process is that
each topic was annotated from start to finish by the same annotator who
originally devised the topic, which approximated an end-user scenario.
Sites were able to interact more or less directly with the topic creators,
as a search engine would with a user.
Topics were reviewed by managers and senior annotators to check spelling,
consistency, and thoroughness. Clarification forms were reviewed by
managers and topic creators to ensure that all forms had been answered
completely.
Quality control measures for the evaluation relevance assessment task
involved managers, technical support staff, and annotators, who performed
the following checks on the data:
o Technical staff
- Confirmed that LDC's passage results match NIST's
passage output
- Confirmed that LDC judged the correct documents for
each topic
- Removed the "docs.excluded.from.results" documents
o Managers
- Spot-checked labels against topic descriptions
- Confirmed that granularity of annotated topics
matches granularity sent to sites
- Modified assessments based on annotator quality
control
o Annotators
- Reviewed lists of all RELEVANT and ON-TOPIC stories
for their topics to ensure that their judgments were
consistent.
5. Annotated Data Profile
The table below summarizes the volume and type of annotations provided by LDC
for the HARD2004 evaluation:
Data Type Training Evaluation
-------------------------------------------------------------------
Topics 21 50
Clarification form responses 0 2,800
Document relevance judgments 2,100 36,938
Passage relevance judgments 0 2,767
6. Source Data Profile
The corpus comprises eight English newswire and web text sources from
January-December 2003. The sources are:
AFE: Agence France Presse - English
APE: Associated Press Newswire
CNE: Central News Agency Taiwan - English
LAT: Los Angeles Times/Washington Post
NYT: New York Times
SLN: Salon.com
UME: Ummah Press - English
XIE: Xinhua News Agency - English
Volume of data for each source appears in the table below:
Source Stories Total Tokens Average Token/Story
----------------------------------------------------------
AFE: 226,515 71,829,978 317
APE: 237,067 93,294,584 393
CNE: 3,674 797,194 217
LAT: 18,287 12,576,721 687
NYT: 28,190 16,673,028 591
SLN: 3,321 4,710,500 1,418
UME: 2,607 782,064 299
XIE: 117,854 24,016,670 203
Total: 637,515 224,680,739
7. Directory Structure
/docs - contains annotation guidelines and other corpus documentation.
/training
/topics - contains training topic descriptions
/annotations - contains relevance assessments
/evaluation
/topics - contains training topic descriptions
/annotations - contains relevance assessments
/clarification_forms
/forms - contains .html CFs submitted by HARD sites
/responses - contains annotator responses to CFs
8. File Format Description
8.1. Topics
Topic descriptions are contained in a plain text file with XML tags as follows:
Hard-nnn
Short, few words description of topic
Sentence-length description of
topic.
Paragraph-length description of topic. No
mention of restrictions captured in the metadata should
occur in this section. This is intended primarily to help
future relevance assessors. No specific format is
required.
If the topic file also includes metadata, the specification is as follows:
Hard-nnn
Short, few words description of topic
Sentence-length description of
topic.
Paragraph-length description of topic. No
mention of restrictions captured in the metadata should
occur in this section. This is intended primarily to help
future relevance assessors. No specific format is
required.
Spells out how the author intends their
metadata items to be interpreted in the context of the
topic. This provides a check that everyone understands the
metadata in the same way and how it affects
relevance.
passage | document
little | much
news-report | opinion-editorial | other | any
US | non-US | any
On-topic but not relevant text
Relevant text
free text entry
8.2. Clarification forms
Clarification forms are in HTML format. No strict guidelines regarding
original format were circulated to the community. The only restrictions
were that cfs be displayable by Netscape 4.78, not contain JavaScript, and
include a cgi-script that would log the results to each form on LDC
servers.
See http://www.ldc.upenn.edu/Projects/HARD/cfs.html for more details.
8.3. Annotations
Relevance table formats are described in README files within each
annotation directory.
9. Contact Information
Further information about this data release can be obtained by
contacting the Linguistic Data Consortium HARD 2004 managers:
- Meghan Glenn, Lead Annotator (mlglenn@ldc.upenn.edu)
- Stephanie Strassel, Associate Director, Annotation Research &
Program Coordination (strassel@ldc.upenn.edu)
For further information about the HARD project at LDC, visit
http://www.ldc.upenn.edu/Projects/HARD
For more information about current efforts in the HARD track, and for
detailed guidelines for the research community, the Center for
Intelligent Information Retrieval at the University of Massachusetts
maintains an up-to-date website.
http://ciir.cs.umass.edu/research/hard
10. Update Log
Readme created by Meghan Glenn, October 28, 2005
Updated by Stephanie Strassel, December 9, 2005