HARD 2004 Text LDC Catalog Number: LDC2005E28 Linguistic Data Consortium December 6, 2005 Contents: 1. Introduction 2. Data Content 3. Data Format 4. Notice about the New York Times(R) data 5. Copyright Information 1. Introduction --------------- This corpus contains source data for the 2004 TREC HARD (High Accuracy Retrieval from Documents) Evaluation. HARD 2004 was a track within the NIST Text REtrieval Conference (TREC), with the objective of achieving high accuracy retrieval from documents by leveraging additional information about the searcher and/or the search context, through techniques like passage retrieval and the use of targeted interaction with the searcher. The current corpus was previously distributed to HARD Participants as LDC2004E30. The topics and annotations that correspond to this release are distributed as LDC2005T29, HARD 2004 Topics and Annotations. This corpus was created with support from the DARPA TIDES Program and LDC. 2. Data Content --------------- The corpus comprises eight English newswire and web text sources from January-December 2003. The sources are: AFE: Agence France Presse - English APE: Associated Press Newswire CNE: Central News Agency Taiwan - English LAT: Los Angeles Times/Washington Post NYT: New York Times SLN: Salon.com UME: Ummah Press - English XIE: Xinhua News Agency - English Volume of data for each source appears in the table below: Source Stories Total Tokens Average Token/Story ---------------------------------------------------------- AFE: 226,515 71,829,978 317 APE: 237,067 93,294,584 393 CNE: 3,674 797,194 217 LAT: 18,287 12,576,721 687 NYT: 28,190 16,673,028 591 SLN: 3,321 4,710,500 1,418 UME: 2,607 782,064 299 XIE: 117,854 24,016,670 203 Total: 637,515 224,680,739 3. Data format -------------- Files are organized by source on a daily basis. Each file contains multiple documents identified by unique document IDs, in the form "SRCyyyymmdd.nnnn", where 'nnnn' is a sequential number starting from "0001" for each source/day. In addition, each document has some or all of the following components: - Keyword (optional), surrounded by tags - Date/time (optional), surrounded by tags - Headline, surrounded by tags - Main part, surrounded by tags.

tags are used within this part to identify paragraph boundaries. 4. Notice about the New York Times (NYT) data --------------------------------------------- From January 24, 2003 until the end of that year, a technical problem affected the modem used to collect New York Times(R) data, and this went undetected until the data were being prepared for use in the HARD 2004 corpus. As a result of this problem, the NYT portion of the corpus contains fewer stories than expected. Even among the stories that are present, there is a small but noticeable amount of data corruption, affecting an indeterminate number of stories. In general, the "noise" in the data tends be limited to just a few word tokens per story, and often shows up in the form of character substitutions (e.g. a non-space character where a space would be expected, digits or punctuation in the middle of a word, and so on). We have also seen some cases where brief portions of the text are missing. The LDC, in cooperation with project sponsors, decided that the benefit of wider variation in source material provided by NYT would out-weigh the disadvantages posed by these problems in the data. To minimize the impact of the problems on annotation, LDC has applied extra quality control measures to reduce the level of distraction caused by the noisy data. These include: - Removing articles with pervasive noise - Removing obviously unreadable segments, e.g. very long sequences of random characters without a space. - Repairing the line wrapping where necessary. 5. Copyright Information ------------------------ Portions (c) 2003 Agence France Presse, Associated Press Newswire, Central News Agency Taiwan, Los Angeles Times/Washington Post, New York Times, Salon.com, Ummah Press, Xinhua News Agency