HARD 2004 Text

		      LDC Catalog Number: LDC2005E28
			Linguistic Data Consortium

			     December 6, 2005



Contents:
    1. Introduction
    2. Data Content
    3. Data Format
    4. Notice about the New York Times(R) data
    5. Copyright Information

1. Introduction
---------------
     This corpus contains source data for the 2004 TREC HARD (High Accuracy
Retrieval from Documents) Evaluation.  HARD 2004 was a track within the
NIST Text REtrieval Conference (TREC), with the objective of achieving
high accuracy retrieval from documents by leveraging additional
information about the searcher and/or the search context, through
techniques like passage retrieval and the use of targeted interaction with
the searcher.  The current corpus was previously distributed to HARD
Participants as LDC2004E30. The topics and annotations that correspond to
this release are distributed as LDC2005T29, HARD 2004 Topics and
Annotations. This corpus was created with support from the DARPA TIDES
Program and LDC.


2. Data Content
---------------
    The corpus comprises eight English newswire and web text sources from 
    January-December 2003.  The sources are:

    AFE: Agence France Presse - English
    APE: Associated Press Newswire
    CNE: Central News Agency Taiwan - English
    LAT: Los Angeles Times/Washington Post
    NYT: New York Times
    SLN: Salon.com
    UME: Ummah Press - English
    XIE: Xinhua News Agency - English

    Volume of data for each source appears in the table below:

    Source  Stories       Total Tokens     Average Token/Story
    ----------------------------------------------------------
    AFE:    226,515	  71,829,978	     317
    APE:    237,067	  93,294,584	     393
    CNE:      3,674          797,194         217
    LAT:     18,287	  12,576,721	     687
    NYT:     28,190       16,673,028         591
    SLN:      3,321        4,710,500       1,418 
    UME:      2,607          782,064         299
    XIE:    117,854       24,016,670         203

    Total:  637,515	 224,680,739
 

3. Data format
--------------
    Files are organized by source on a daily basis. Each file contains
    multiple documents identified by unique document IDs, in the form
    "SRCyyyymmdd.nnnn", where 'nnnn' is a sequential number starting
    from "0001" for each source/day.

    In addition, each document has some or all of the following components:

       - Keyword (optional), surrounded by <KEYWORD> tags
       - Date/time (optional), surrounded by <DATE_TIME> tags
       - Headline, surrounded by <HEADLINE> tags
       - Main part, surrounded by <TEXT> tags. <P> tags are used within
         this part to identify paragraph boundaries.

4. Notice about the New York Times (NYT) data
---------------------------------------------
    From January 24, 2003 until the end of that year, a technical problem
    affected the modem used to collect New York Times(R) data, and this went
    undetected until the data were being prepared for use in the HARD 2004
    corpus.  As a result of this problem, the NYT portion of the corpus
    contains fewer stories than expected.  Even among the stories that are
    present, there is a small but noticeable amount of data corruption,
    affecting an indeterminate number of stories.  In general, the "noise" in
    the data tends be limited to just a few word tokens per story, and often
    shows up in the form of character substitutions (e.g. a non-space
    character where a space would be expected, digits or punctuation in the
    middle of a word, and so on).  We have also seen some cases where brief
    portions of the text are missing.

    The LDC, in cooperation with project sponsors, decided that the benefit of
    wider variation in source material provided by NYT would out-weigh the
    disadvantages posed by these problems in the data.

    To minimize the impact of the problems on annotation, LDC has applied
    extra quality control measures to reduce the level of distraction caused
    by the noisy data.  These include:

       - Removing articles with pervasive noise
       - Removing obviously unreadable segments, e.g. very long sequences
         of random characters without a space.
       - Repairing the line wrapping where necessary.

5. Copyright Information
------------------------
    Portions (c) 2003 Agence France Presse, Associated Press Newswire,
    Central News Agency Taiwan, Los Angeles Times/Washington Post, New York
    Times, Salon.com, Ummah Press, Xinhua News Agency