README File for the GIGAWORD ENGLISH TEXT CORPUS ================================================ INTRODUCTION ------------ The Gigaword English Corpus is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC), at the University of Pennsylvania. Four distinct international sources of English newswire are represented here: - Agence France Press English Service (afe) - Associated Press Worldstream English Service (apw) - The New York Times Newswire Service (nyt) - The Xinhua News Agency English Service (xie) The three-character abbreviations shown above represent both the directory names where the data files are found, and the 3-letter prefix that appears at the beginning of every file name. Much of the content in this collection has been published previously by the LDC in a variety of other, older corpora, particularly the North American News text corpora, the various TDT corpora, and the AQUAINT text corpus. But there is a significant amount of material that is being released here for the first time: all of the Agence France Presse content, the 1995 and 2001 Xinhua content, and the portions of NYT and APW dating from February 2001 forward. DATA FORMAT AND SGML MARKUP --------------------------- Each data file name consists of the 3-letter prefix, followed by a 6-digit date (representing the year and month during which the file contents were generated by the respective news source), followed by a ".gz" file extension, indicating that the file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). So, each file contains all the usable data received by LDC for the given month from the given news source. All text data are presented in SGML form, using a very simple, minimal markup structure; all text consists of printable ASCII and whitespace. The file "gigaword_e.dtd" in the "docs" directory provides the formal "Document Type Declaration" for parsing the SGML content. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using this DTD file. The markup structure, common to all data files, can be summarized as follows: The Headline Element is Optional -- not all DOCs have one The Dateline Element is Optional -- not all DOCs have one

Paragraph tags are only used if the 'type' attribute of the DOC happens to be "story" -- more on the 'type' attribute below...

Note that all data files use the UNIX-standard "\n" form of line termination, and text lines are generally wrapped to a width of 80 characters or less.

For every "opening" tag (DOC, HEADLINE, DATELINE, TEXT, P), there is a corresponding "closing" tag -- always. The attribute values in the DOC tag are always presented within double-quotes; the "id=" attribute of DOC consists of the 3-letter source abbreviation (in CAPS), an 8-digit date string representing the date of the story (YYYYMMDD), a period, and a 4-digit sequence number starting at "0001" for each date (e.g. "NYT199501.0001"); in this way, every DOC in the corpus is uniquely identifiable by the id string. Every SGML tag is presented alone on one line, separate from other tags, and from the text content (so a simple process like the UNIX "grep -v '<'" will eliminate all tags, and retain all the text content). The structure shown above represents some notable differences relative to the markup strategy employed in previous LDC text corpora; these are intended to facilitate bulk processing of the present corpus. The major differences are: - Earlier corpora usually organized the data as one file per day, or limited the average file size to one megabyte (MB). Typical compressed file sizes in the current corpus range from about 3 MB (1995 Xinhua data) to about 30 MB (1996-7 NYT data); this equates to a range of about 9 to 90 MB when the data are uncompressed. In general, these files are not intended for use with interactive text editors or word processing software (though many such programs are likely to work reasonably well with these files). Rather, it's expected that the files will be used as input to programs that are geared to dealing with data in such quantities, for filtering, conditioning, indexing, statistical summary, etc. (The LDC can provide open source software, mostly written in Perl, for extracting DOCs from such data files, using the "id" string or other search criteria for story selection; see http://www.ldc.upenn.edu/Using/ .) - Earlier corpora tended to use different markup outlines (different tag sets) depending on the source of the data, because different sources came to us with different structural properties, and we had chosen to preserve these as much as possible (even though many elements of the delivered structure may have been meaningless for research use). The present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs). The "dateline" is a brief string typically found at the beginning of the first paragraph in each news story, giving the location the report is coming from, and sometimes the news service and/or date; since this content is not part of the initial sentence, we separate it from the first paragraph (this was not done in previous corpora). - Earlier corpora tended to include "custom" SGML entity references, which were intended to preserve things like special punctuation or typesetting instructions (e.g. "&QL;", "&UR;", "&MD;", etc). The present corpus uses only one SGML entity reference: ``&'' (or ``&'' -- both upper-case and lower-case forms are present), which represents the literal ampersand "&" character. All other specialized control characters have been filtered out, and unusual punctuation (such as the underscore character, used in NYT and APW to represent an "em-dash" character) has been left as-is, or converted to simple equivalents (e.g. hyphens). - In earlier corpora, newswire data were presented as streams of undifferentiated "DOC" units; depending on the source and corpus, varying amounts of quality checking and filtering were done to eliminate noisy or unsuitable content (e.g. test messages). For this release, all sources have received a uniform treatment in terms of quality control, and we have applied a rudimentary (and _approximate_) categorization of DOC units into four distinct "types". The classification is indicated by the `` type="string" '' attribute that is included in each opening ``DOC'' tag. The four types are: * story : This is by far the most frequent type, and it represents the most typical newswire item: a coherent report on a particular topic or event, consisting of paragraphs and full sentences. As indicated above, the paragraph tag "

" is found only in DOCs of this type; in the other types described below, the text content is rendered with no additional tags or special characters -- just lines of ASCII tokens separated by whitespace. * multi : This type of DOC contains a series of unrelated "blurbs", each of which briefly describes a particular topic or event; this is typically applied to DOCs that contain "summaries of todays news", "news briefs in ... (some general area like finance or sports)", and so on. Each paragraph-like blurb by itself is coherent, but it does not bear any necessary relation of topicality or continuity relative to it neighbors. * advis : (short for "advisory") These are DOCs which the news service addresses to news editors -- they are not intended for publication to the "end users" (the populations who read the news); as a result, DOCs of this type tend to contain obscure abbreviations and phrases, which are familiar to news editors, but may be meaningless to the general public. We also find a lot of formulaic, repetitive content in DOCs of this type (contact phone numbers, etc). * other : This represents DOCs that clearly do not fall into any of the above types -- in general, items of this type are intended for broad circulation (they are not advisories), they may be topically coherent (unlike "multi" type DOCS), and they typically do not contain paragraphs or sentences (they aren't really "stories"); these are things like lists of sports scores, stock prices, temperatures around the world, and so on. The general strategy for categorizing DOCs into these four classes was, for each source, to discover the most common and frequent clues in the text stream that correlated with the three "non-story" types, and to apply the appropriate label for the ``type=...'' attribute whenever the DOC displayed one of these specific clues. When none of the known clues was in evidence, the DOC was classified as a "story". This means that the most frequent classification error will tend to be the use of `` type="story" '' on DOCs that are actually some other type. But the number of such errors should be fairly small, compared to the number of "non-story" DOCs that are correctly tagged as such. Note that the markup was applied algorithmically, using logic that was based on less-than-complete knowledge of the data. For the most part, the HEADLINE, DATELINE and TEXT tags have their intended content; but due to the inherent variability (and the inevitable source errors) in the data, users may find occasional mishaps where the headline and/or dateline were not successfully identified (hence show up within TEXT), or where an initial sentence or paragraph has been mistakenly tagged as the headline or dateline. DATA QUANTITIES --------------- The "docs" directory contains a set of plain-text tables (datastats.*) that describe the quantities of data by source and month (i.e. by file), broken down according to the four "type" categories. The overall totals for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data you get when the files are uncompressed (i.e. nearly 12 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM; the "K-wrds" numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated. Source #Files Gzip-MB Totl-MB K-wrds #DOCs AFE 44 417 1216 170969 656269 APW 91 1213 3647 539665 1477466 NYT 96 2104 5906 914159 1298498 XIE 83 320 940 131711 679007 TOTAL 314 4054 11709 1756504 4111240 The following tables present "Text-MB", "K-wrds" and "#DOCS" broken down by source and DOC type; "Text-MB" represents the total number of characters (including whitespace) after SGML tags are eliminated. Text-MB K-wrds #DOCs type="advis": AFE 33 3748 16788 APW 115 17292 29628 NYT 446 69453 126812 XIE 12 1885 6473 TOTAL 606 92378 179701 type="multi": AFE 27 4032 12072 APW 212 33934 50143 NYT 110 17773 28455 XIE 58 9125 41367 TOTAL 407 64864 132037 type="other": AFE 25 3575 36279 APW 235 33751 214710 NYT 109 16195 23867 XIE 33 4981 44776 TOTAL 402 58502 319632 type="story": AFE 992 159614 591130 APW 2791 454693 1182985 NYT 4904 810744 1119364 XIE 728 115717 586391 TOTAL 9415 1540768 3479870 GENERAL AND SOURCE-SPECIFIC PROPERTIES OF THE DATA -------------------------------------------------- Most of the text data (all of AFE and NYT, most of APW) were received at LDC via dedicated, 24-hour/day electronic feeds (leased phone lines in the case of APW and NYT, a local satellite dish for AFE). These 24-hour transmission services were all susceptible to "line noise" (occasional corruption of text content), as well as service outages both at the data source and at our receiving computers. Usually, the various disruptions of a newswire data stream would leave tell-tale evidence in the form of byte values falling outside the range of printable ASCII characters, or recognizable patterns of anomalous ASCII strings. All XIE data and a two-year portion of APW data were received as bulk electronic text archives via internet retrieval. As such, they were not susceptible to modem line-noise or related disruptions, though this does not guarantee that the source data are free of mishaps. We can say for certain that all the data, including the internet bulk archives, have undergone a consistent extent of quality control, to eliminate non-ASCII content and other obvious forms of corruption. Naturally, since the source data are all generated manually on a daily basis, there will be a small percentage of human errors common to all sources: missing whitespace, incorrect or variant spellings, badly formed sentences, and so on, as are normally seen in newspapers. No attempt has been made to address this property of the data. Another common feature to be noted is that stories may be repeated in the course of daily transmissions (or daily archiving). Sometimes a later transmission of a story comes with minor alterations (fixed spelling, one or more paragraphs added or removed); but just as often, the collection ends up with two or more DOCs that are fully identical. In general, though, this practice affects a relatively small minority of the overall content. (NYT is perhaps the worst offender in this regard, sometimes sending as many as six copies of some featured story.) No attempt has been made to eliminate these duplications. Finally, the 24-hour services typically show a practice of breaking long stories into chunks, and sending the chunks as separate DOC units, with each unit having the normal structural features of a full story. (This is especially prevalent in NYT, which has the longest average story length of all the sources.) Normally, when this sort of splitting is done, cues are provided in the text of each chunk that allow editors to reconstruct the full report; but these cues tend to rely heavily on editorial skills -- it is taken for granted by each news service that the stories will be reassembled manually as needed -- so the process of combining the pieces into a full story is not amenable to an algorithmic solution, and no attempt has been made to do this. The following sections explain data properties that are particular to each source. AFE: There is a gap of 54 months in the AFE collection (about four and a half years), spanning from May 1997 to December 2001; the LDC had discontinued its subscription to the AFP English wire service during this period, and at the point where we restored the subscription near the end of 2001, there was no practical means for recovering the portion that was missed. Apart from this, the AFE content shows a high degree of internal consistency (relative to APW and NYT), in terms of day-to-day content and typographic conventions. APW: This service provides up to six other languages besides English on the same modem connection, with DOCs in all languages interleaved at random; of course, we have extracted just the English content for publication here. The service draws news from quasi-independent offices around the world, so there tends to be more variability here in terms of typographic conventions; there is also a noticeably higher percentage of non-story content, especially in the "other" category: tables of sports results, stocks, weather, etc. During the period between August 1999 and August 2001, the modem service failed to deliver English content, while data in other languages continued to flow in. (LDC was spooling the data automatically, and during this period, alarms would be raised only if the data flow stopped completely -- so the absence of English went unnoticed.) On learning of this gap in the data, we were able to recover much of the missing content with help from AP's New York City office and from Richard Sproat at AT&T Labs -- we gratefully acknowledge their assistance. Both were able to supply bulk archives that covered most of the period that we had missed. In particular, August - November 1999 and January - September 2000 were retrieved from a USENET/ClariNet and web archives that AT&T had collected for its own research use, while the October 2000 - August 2001 data were supplied by AP directly from their own web service archive. As a result of the varying sources, these sub-parts of APW data tend to differ from the rest of the collection (and from each other), in terms of daily quantity, extent of typographic variance, and possibly the breadth of subject matter being reported. NYT: There have been only a few scattered service interruptions for NYT, and these typically involve gaps of a few days (the longest was about two weeks). The NYT service provides not only the content that is specific to the New York Times daily newspaper publication, but also a wide and varied sampling of news and features from other urban and regional newspapers around the U.S., including: Albany Times Union Arizona Republic Atlanta Constitution Bloomberg Business News Boston Globe Casper (Wyo.) Star-Tribune Chicago Sun-Times Columbia News Service Cox News Service Fort Worth Star-Telegram Hearst Newspapers Houston Chronicle International Herald Tribune Kansas City Star Los Angeles Daily News San Antonio Express-News San Francisco Chronicle Seattle Post-Intelligencer States News Service Typically, the actual source of a given DOC was indicated in the raw data via an abbreviation (e.g. AZR, BLOOM, COX, LADN, NYT, SPI, etc) at the end of the "slug" line that accompanies every story. (The "slug" is a short string, usually less than 40 characters, that news editors use to tag and sort stories and topics over the course of a day.) Because this feature of NYT slug lines is quite consistent and informative, the markup strategy was adapted to make sure that the full slug line would always be included as part of the content of the "DATELINE" tag in every DOC. (Slugs were either not present or not retained in the other three newswire sources.) Some examples: TEMPE, Ariz. (BC-FIESTA-BLOCK-AZR) LOS ANGELES (BC-BKN-LAKERS-ONEAL-LADN) NEW YORK (BC-NY-NEWYEAR-ART-1STLD-WRITETHRU-675&ADD-NYT) (BC-OBIT-KENNEDY-NYT) The first three examples are cases where the opening paragraph had a dateline string; in the fourth, the opening paragraph had no dateline. The slug is normally ALL-CAPS-AND-HYPHENS (this is how it is presented by the newswire service -- there are some exceptions, of course, and the occasional glitch); it is always preceded by a space and an open parenthesis, and always followed by a close parenthesis. Meanwhile, the dateline string taken from the first paragraph (when present) is always presented first on the line, with no initial space; it can be mixed-case, may have multiple word tokens, and may have punctuation. Features of text formatting, style and subject matter may vary somewhat according to the original source. Overall, NYT shows the largest amount of "advisory" content, both in terms of how many DOCs are addressed specifically to the receiving news editors, and in terms of additional "advice" included within regular news stories, e.g. "(STORY CAN END HERE. OPTIONAL MATERIAL FOLLOWS)". XIE: The Xinhua English news archive provided fairly consistent formatting and coverage spanning 1995 through 2001, making it fairly easy to prepare for research use. Many stories have the distinct flavor of an official government information source, in contrast to the other news services represented here. The material is otherwise unremarkable. David Graff Linguistic Data Consortium January, 2003