HARD 2004 Topics and Annotations
LDC2005E29 
December 9, 2005

Linguistic Data Consortium

1. Introduction

The HARD 2004 Topics and Annotations Corpus was produced by Linguistic Data
Consortium (LDC) and contains topics and annotations (clarification forms,
responses and relevance assessments) for the 2004 TREC HARD (High Accuracy
Retrieval from Documents) Evaluation.  HARD 2004 was a track within the
NIST Text REtrieval Conference (TREC), with the objective of achieving high
accuracy retrieval from documents by leveraging additional information
about the searcher and/or the search context, through techniques like
passage retrieval and the use of targeted interaction with the searcher.

The current corpus was previously distributed to HARD Participants as
LDC2004E42 and LDC2005E17. The source data that corresponds to this release
is distributed as LDC2005E28, HARD 2004 Text.  This corpus was created with
support from the DARPA TIDES Program and LDC.

Three major annotation tasks are represented in this release: Topic
Creation, Clarification Form Responses, and Relevance Assessment.  Topics
include a short title, query plus context, and a number of limiting
parameters known as "metadata" which include targeted geographical region,
target data domain or genre, and level of searcher expertise.
Clarification Forms are brief HTML questionnaires system developers
submitted to LDC searchers to glean additional information about
information needs directly from the topic creators.  Relevance assessment
consisted of adjudication of pooled system responses, and included
document-level judgments for all topics, and passage-level relevance
judgments for a subset of topics.

The release is divided into training and evaluation resources.  The
training set comprises twenty-one topics and 100 document-level relevance
judgments per topic.  The evaluation set contains fifty topics,
clarification forms and responses, document-level relevance assessment for
all topics and passage-level relevance judgments for half of the
topics. HARD participants received the reference data over the course of
the evaluation cycle in stages: (0) training data (topics, metadata and
annotations) (1) evaluation topic descriptions without metadata, (2)
clarification form responses, (3) eval topic descriptions with metadata and
(4) eval topic relevance assessments.

For more information about the HARD 2004 project, please visit
http://www.ldc.upenn.edu/Projects/HARD.

2. Topics and Metadata

HARD topics are created by LDC annotators based on annotators' actual
interests and information needs. A topic is a theme-based research query
which is not strictly event-based but is also not overly broad.  Generally,
HARD topics are research queries, such as "What new uses will we find for
corn in the future?" or "How is globalization influencing the Indian
media?".  Topic information follows the TREC standard and includes a short
title, a sentence-long query and a paragraph-long narrative, each of which
describes the topic in increasing detail.

HARD topics also add Metadata or paramaters that further limit the query
space.  Each metadata category is assigned a value during topic
creation. The goal of the metadata is to develop a sort of personal profile
that will differentiate users' results.  There are six metadata categories.
GENRE refers to desired data domain of results; annotators select
"news-report", "opinion-editorial", "other", or "any". GEOGRAPHY refers to
the geographical region of desired results; options are "US", "non-US", and
"any".  GRANULARITY is the level of document detail or amount of text --
entire document or specific passage -- that a topic creator wants his or
her results to be in.  FAMILIARITY is the level of expertise the topic
creator possess in the field of the query; options are "little" or "much".
SUBJECT is one of twelve general categories, such as Health&Medicine or
Society, into which the topic fits.  Finally, RELATED TEXT is an optional
part of topic creation, where annotators paste text examples of the kinds
of results they are looking for.

Annotators used a web-based topic creation form to guide their work.  See 
docs/topic_creation_2004.html

3. Clarification Forms

HARD sites had the option of submitting clarification forms to LDC
assessors in order to garner additional feedback from topic creators.
Clarification forms typically consisted of a short HTML document asking for
information like keyword relevance ranking or passage relevance
assessment. The following restrictions applied:

   1. The CF must display correctly on Netscape V4.78 running on Solaris 2.5.1
   2. The CF cannot be larger than can be displayed on a 16-inch monitor
      (an earlier draft indicated incorrectly that a 17-inch monitor was
      the minimum)
   3. The screen real estate you have available is 1152 x 900
   4. The CF must be an HTML Web page.  No Javascript, no Java, no flash, no anything but HTML.
   5. The page may not refer to external images: it must be self-contained
   6. The following types of data entry will be permitted (others are possible, but check in advance):
          * text boxes
          * radio and check buttons
          * drop-down menu selections

The assessor will spend no more than three (3) minutes filling out the form
for a particular topic, meaning up to 150 minutes per site.

After receiving the forms, LDC annotators logged into a web-based system
that displayed forms for that user's set of topics.  Forms for each topic
were displayed in random order (rather than alphabetical by the name of the
site which could lead to bias).  User judgments were logged to a database.

3. Relevance Annotations

3.1 Training data 

To provide training data for HARD, the HARD corpus was indexed using local
tools, and a relevance-ranked list of 100 documents was returned to the
annotator.  LDC annotators assessed these documents using an annotation
tool developed specifically for this task.

Documents received one of three labels:

	1) RELEVANT (also, HARD-rel, value=1): The document is both relevant to
	   the topic statement and meets all "metadata" restrictions

	2) ON-TOPIC (also, SOFT-rel, value=0.5): The document is relevant to the
	   topic statement but fails to meet all "metadata"
	   restrictions (Genre, Familiarity, Geography)

	3) OFF-TOPIC (value=0): The document is not at all relevant to the topic
	   statement

3.2 Evaluation Data

For assessing document relevance for the 50 evaluation topics, NIST
distributed pooled site results to LDC (85 documents per site, per topic).
LDC then used local annotation tools to assess document relevance using the
three labels described above.

Twenty-five HARD2004 topics were also reviewed for passage-level relevance,
as specified in the metadata GRANULARITY value. The 25 topics are:

		HARD-407 HARD-408
		HARD-410 HARD-412
		HARD-413 HARD-415
		HARD-416 HARD-420
		HARD-421 HARD-422
		HARD-423 HARD-424
		HARD-425 HARD-426
		HARD-427 HARD-428
		HARD-429 HARD-435
		HARD-439 HARD-442
		HARD-443 HARD-444
		HARD-445 HARD-446
		HARD-449

The documents for these topics were further annotated for passage-level
relevance where the document label was RELEVANT or ON-TOPIC.  LDC's HARD
annotation tool launches a second application for passage-level retrieval
when assessors judge a document to be RELEVANT. For ON-TOPIC documents a
wrapper is used to launch the passage retrieval tool to extract passages
after all other annotation is complete.  The reason for the difference in
approach to RELEVANT versus ON-TOPIC passages is that the annotation tool
did not originally support ON-TOPIC passage extraction.

4. Workflow and Quality Control

HARD annotation workflow was controlled by AWS.  AWS is an automated
workflow system developed by LDC that assigns topics, files and tasks to
annotators according to their managers' specifications. The system allows
for multiple workflows depending on task staging and project
requirements. A unique feature of the HARD 2004 annotation process is that
each topic was annotated from start to finish by the same annotator who
originally devised the topic, which approximated an end-user scenario.
Sites were able to interact more or less directly with the topic creators,
as a search engine would with a user.

Topics were reviewed by managers and senior annotators to check spelling,
consistency, and thoroughness. Clarification forms were reviewed by
managers and topic creators to ensure that all forms had been answered
completely.

Quality control measures for the evaluation relevance assessment task
involved managers, technical support staff, and annotators, who performed
the following checks on the data:

	o Technical staff

		- Confirmed that LDC's passage results match NIST's
		  passage output
		- Confirmed that LDC judged the correct documents for
		  each topic
		- Removed the "docs.excluded.from.results" documents

	o Managers

		- Spot-checked labels against topic descriptions
		- Confirmed that granularity of annotated topics
		  matches granularity sent to sites
		- Modified assessments based on annotator quality
		  control
	
	o Annotators

		- Reviewed lists of all RELEVANT and ON-TOPIC stories
		  for their topics to ensure that their judgments were
		  consistent.

5. Annotated Data Profile

The table below summarizes the volume and type of annotations provided by LDC
for the HARD2004 evaluation:

Data Type				Training 	Evaluation
-------------------------------------------------------------------
Topics			  		21		50
Clarification form responses	  	0		2,800 
Document relevance judgments		2,100		36,938
Passage relevance judgments		0		2,767

6. Source Data Profile

The corpus comprises eight English newswire and web text sources from
January-December 2003.  The sources are:

    AFE: Agence France Presse - English
    APE: Associated Press Newswire
    CNE: Central News Agency Taiwan - English
    LAT: Los Angeles Times/Washington Post
    NYT: New York Times
    SLN: Salon.com
    UME: Ummah Press - English
    XIE: Xinhua News Agency - English

Volume of data for each source appears in the table below:

    Source  Stories       Total Tokens     Average Token/Story
    ----------------------------------------------------------
    AFE:    226,515	  71,829,978	     317
    APE:    237,067	  93,294,584	     393
    CNE:      3,674          797,194         217
    LAT:     18,287	  12,576,721	     687
    NYT:     28,190       16,673,028         591
    SLN:      3,321        4,710,500       1,418 
    UME:      2,607          782,064         299
    XIE:    117,854       24,016,670         203

    Total:  637,515	 224,680,739


7. Directory Structure

/docs  - contains annotation guidelines and other corpus documentation.

/training
  /topics - contains training topic descriptions
  /annotations - contains relevance assessments

/evaluation
  /topics - contains training topic descriptions
  /annotations - contains relevance assessments
  /clarification_forms
      /forms - contains .html CFs submitted by HARD sites
      /responses - contains annotator responses to CFs
      

8. File Format Description

8.1. Topics 

Topic descriptions are contained in a plain text file with XML tags as follows:

<topics>
    <topic>
        <number> Hard-nnn </number>
        <title> Short, few words description of topic </title>
        <description> Sentence-length description of
        topic. </description>
        <topic-narrative> Paragraph-length description of topic. No
        mention of restrictions captured in the metadata should
        occur in this section. This is intended primarily to help
        future relevance assessors.  No specific format is
        required. </topic-narrative>
    </topic> </topics>

If the topic file also includes metadata, the specification is as follows: 

<topics>
    <topic>
        <number> Hard-nnn </number>
        <title> Short, few words description of topic </title>
        <description> Sentence-length description of
        topic. </description>
        <topic-narrative> Paragraph-length description of topic. No
        mention of restrictions captured in the metadata should
        occur in this section. This is intended primarily to help
        future relevance assessors.  No specific format is
        required. </topic-narrative>
        <metadata-narrative> Spells out how the author intends their
        metadata items to be interpreted in the context of the
        topic.  This provides a check that everyone understands the
        metadata in the same way and how it affects
        relevance. </metadata-narrative>
        <retrieval-element> passage | document </retrieval-element>
        <metadata>
            <familiarity> little | much </familiarity>
            <genre> news-report | opinion-editorial | other | any </genre>
            <geography> US | non-US | any </geography>
            <related-text>
               <on-topic> On-topic but not relevant text </on-topic>
               <relevant> Relevant text </relevant>
            </related-text>
            <subject> free text entry </subject>
        </metadata>
    </topic> </topics>

8.2. Clarification forms

Clarification forms are in HTML format. No strict guidelines regarding
original format were circulated to the community. The only restrictions
were that cfs be displayable by Netscape 4.78, not contain JavaScript, and
include a cgi-script that would log the results to each form on LDC
servers.

See http://www.ldc.upenn.edu/Projects/HARD/cfs.html for more details.

8.3. Annotations

Relevance table formats are described in README files within each
annotation directory.

9. Contact Information

Further information about this data release can be obtained by
contacting the Linguistic Data Consortium HARD 2004 managers:

- Meghan Glenn, Lead Annotator (mlglenn@ldc.upenn.edu)

- Stephanie Strassel, Associate Director, Annotation Research &
  Program Coordination (strassel@ldc.upenn.edu)

For further information about the HARD project at LDC, visit 
       http://www.ldc.upenn.edu/Projects/HARD

For more information about current efforts in the HARD track, and for
detailed guidelines for the research community, the Center for
Intelligent Information Retrieval at the University of Massachusetts
maintains an up-to-date website.
        http://ciir.cs.umass.edu/research/hard

10. Update Log

Readme created by Meghan Glenn, October 28, 2005
Updated by Stephanie Strassel, December 9, 2005