Home › Language Resources › Data

HARD 2004 Topics and Annotations

Item Name:	HARD 2004 Topics and Annotations
Author(s):	Stephanie Strassel, Meghan Glenn
LDC Catalog No.:	LDC2005T29
ISBN:	1-58563-373-9
ISLRN:	721-717-066-331-5
DOI:	https://doi.org/10.35111/8sx8-1q92
Release Date:	December 20, 2005
Member Year(s):	2005
DCMI Type(s):	Text
Data Source(s):	newswire
Application(s):	automatic content extraction, information detection, information extraction, information retrieval, topic detection and tracking
Language(s):	English
Language ID(s):	eng
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2005T29 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Strassel, Stephanie, and Meghan Glenn. HARD 2004 Topics and Annotations LDC2005T29. Web Download. Philadelphia: Linguistic Data Consortium, 2005.
Related Works: Hide	View isAnnotationOf LDC2005T28 HARD 2004 Text

Introduction

The HARD 2004 Text Corpus was developed by the Linguistic Data Consortium (LDC) and contains approximately 225 million tokens of English text.

This corpus contains source data for the 2004 TREC HARD (High Accuracy Retrieval from Documents) Evaluation. HARD 2004 was a track within the NIST Text REtrieval Conference (TREC), with the objective of achieving high accuracy retrieval from documents by leveraging additional information about the searcher and/or the search context, through techniques like passage retrieval and the use of targeted interaction with the searcher.

The current corpus was previously distributed to HARD Participants as LDC2004E30. The topics and annotations that correspond to this release are distributed as HARD 2004 Topics and Annotations (LDC2005T29). This corpus was created with support from the DARPA TIDES Program and LDC.

Data

The corpus comprises eight English newswire and web text sources from January - December 2003. The sources and their volumes of data appear in the table below:

Source	Code	Stories	Total Tokens	Avg. Token/Story
Agence France Presse - English	AFE	226,515	71,829,978	317
Associated Press Newswire	APE	237,067	93,294,584	393
Central News Agency Taiwan - English	CNE	3,674	797,194	217
Los Angeles Times/Washington Post	LAT	18,287	12,576,721	687
New York Times	NYT	28,190	16,673,028	591
Salon.com	SLN	3,321	4,710,500	1,418
Ummah Press - English	UME	2,607	782,064	299
Xinhua News Agency - English	XIE	117,854	24,016,670	203
Totals		637,515	224,680,739

Files are organized by source on a daily basis. Each file contains multiple documents identified by unique document IDs, in the form "SRCyyyymmdd.nnnn", where 'nnnn' is a sequential number starting from "0001" for each source/day. In addition, each document has some or all of the following components:

Keyword (optional), surrounded by tags
Date/time (optional), surrounded by tags
Headline, surrounded by tags
Main part, surrounded by tags. Tags are used within this part to identify paragraph boundaries.

For more information please visit the HARD Project website.

Samples

For an example of the data in this corpus, please view this sample (TXT).

Updates

None at this time.

HARD 2004 Topics and Annotations

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees