The AQUAINT Corpus of English News Text

Introduction

This file contains documentation on the AQUAINT Corpus, Linguistic Data Consortium (LDC) catalog number LDC2002T31 and isbn 1-58563-240-6.

This corpus consists of newswire text data in English, drawn from three sources: the Xinhua News Service (People's Republic of China), the New York Times News Service, and the Associated Press Worldstream News Service. It was prepared by the LDC for the AQUAINT Project, and will be used in official benchmark evaluations conducted by National Institute of Standards and Technology (NIST).

Organization and Format of the Data

The text data are separated into directories by source (apw, nyt, xie); within each source, data files are subdivided by year, and within each year, there is one file per date of collection. Each file is named to reflect the source and date, and contains a stream of SGML-tagged text data presenting the series of news stories reported on the given date as a concatenation of DOC elements (i.e. blocks of text bounded by <DOC> and </DOC> tags).

All data files are published in compressed form, using the GNU "gzip" utility; as such, all files have a ".gz" extension, and will have null file name extension when uncompressed in the usual way (i.e. just the base file name, consisting of "YYYYMMDD_SRC").

A single DTD file is provided (aquaint.dtd) which serves to define the markup structure for any standard SGML parser. All data files in this corpus have been validated using this DTD.

While all the data files are covered by a single DTD, it is not the case that they all have a single pattern of markup. Rather, all files share a core markup structure, with minor variations in the peripheral regions of each DOC element, and the DTD has been written to accommodate the variations.

For those who might prefer to process the data without using an SGML parser, please note the following regarding markup patterns:

Each line in a file contains either a single SGML tag or text data; if it contains an SGML tag, the left-angle-bracket of the tag is the first character on the line.
The news content of each story is contained with the TEXT element (i.e. bounded by <TEXT> and </TEXT> tags), and includes P tags to mark paragraph boundaries within the text.
Variations in SGML markup across data files all involve elements that are external to the TEXT element in each DOC; the following listing summarizes the markup patterns as a function the directory structure -- parentheses in the listing represent the hierarchical structure of the markup. An asterisk next to a tag name indicates that the tag is "optional" -- it may or may not be present in the DOC units for the given source.


apw
 (DOC (DOCNO) (DOCTYPE*) (DATE_TIME*) (HEADER) (BODY (SLUG) (HEADLINE*) (TEXT (SUBHEAD*) ) ) (TRAILER) )

nyt/1998:
 (DOC (DOCNO) (DOCTYPE) (DATE_TIME) (HEADER) (BODY (SLUG*) (HEADLINE*) (TEXT (ANNOTATION*) ) ) (TRAILER) )

xie/* (all years):
 (DOC (DOCNO) (DATE_TIME) (BODY (HEADLINE) (TEXT) ) )

Data Selection, Overlap with Other Corpora

The sampling for this corpus covers the period from January 1996 to September 2000, inclusive, for the Xinhua text collection, and from June 1998 to September 2000, inclusive, for New York Times and Associated Press. All available stories from each date in the sampling period are included -- about one million DOC elements in all, comprising slightly over 3 gigabytes of data when uncompressed (counting all SGML markup and content peripheral to the TEXT elements of stories).

There is some overlap between the NYT and APW content in this corpus and the content of the TDT2 and TDT3 corpora that have been released previously by the LDC. In particular, a subset of the 1998 data from these two sources has been used in TDT. Stories that appear in both TDT and AQUAINT should have the same DOCNO identifiers in both corpora (even though the markup layout may be slightly different).

Note that previous (non-TDT) corpora of English newswire text from the LDC have included coverage of APW and NYT from 1994 through May 1998, so that the sampling period presented here represents a continuation of that coverage. This is the first LDC corpus to include English text from Xinhua.

Comments on Data Quality

As mentioned previously, all data files have been validated with a standard SGML parser (nsgmls, available from http://www.jclark.com/sp/ using the DTD provided. All one million DOCNO identifiers are globally distinct.

We have sought to make sure, as far as possible, that the element contains all and only the actual "reportage", the discursive language content, of each story. Note that the ANNOTATION element found in NYT data is used to bracket instructions to editors that are not part of the actual story content (typically, directives like "STORY CAN END HERE"); the SUBHEAD element in APW marks "healine-like" strings in stories that consist of reviews or summaries of disparate events.

However, despite our best intentions, there is unavoidable variation in the formatting of text data transmitted over these newswire services, and in a small percentage of stories, the typical cues for delimiting the TEXT content are lacking. As a result, about 0.7% of the DOC units have little or no "TEXT" content, and/or have a substantial amount of data outside the TEXT element. (This mostly occurs in NYT data -- in fact, about 2% of the 314K NYT stories have this problem.)

Also, many of the "stories" transmitted over newswire are actually messages to editors, regarding upcoming content, test messages, and so on -- this is especially true of the APW and NYT streams. We have sought, as far as possible within a limited time, to identify patterns in the data that identify these non-news units in the collection. The "DOCTYPE" element found in these two sources will contain "MISCELLANEOUS TEXT" if a non-news pattern was found in the given DOC, and will contain "NEWS STORY" otherwise. Of course there are bound to be some percentage of non-news DOCs that are incorrectly tagged as "news", but there should be few if any DOCs that are incorrectly tagged as "miscellaneous".

Please see file.tbl for a complete listing of the files.

The AQUAINT Corpus of English News Text

Introduction

Organization and Format of the Data

Data Selection, Overlap with Other Corpora

Comments on Data Quality

Content Copyright