README File for the ENGLISH GIGAWORD TEXT CORPUS ================================================ LDC2011T07 ========== Fifth Edition ============= INTRODUCTION ------------ The English Gigaword Corpus is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) at the University of Pennsylvania. This is the fifth edition of the English Gigaword Corpus. This edition includes all of the contents in the previous edition (LDC2009T13) as well as new data from the same six sources presented there covering 24-month period of January 2009 through December 2010. Note that during this period, one of the sources went through a reorganization and name-change: LA Times/Washington Post became Washington Post/Bloomberg. See the section below titled "General and Source-Specific Properties of the Data" for further details The seven distinct international sources of English newswire included in this edition are the following: - Agence France-Presse, English Service (afp_eng) - Associated Press Worldstream, English Service (apw_eng) - Central News Agency of Taiwan, English Service (cna_eng) - New York Times Newswire Service (nyt_eng) - Xinhua News Agency, English Service (xin_eng) - Los Angeles Times/Washington Post Newswire Service (ltw_eng) - Washington Post/Bloomberg Newswire Service (wpb_eng) The seven-letter codes in the parentheses above include the three-character source name abbreviations and the three-character language code ("eng") separated by an underscore ("_") character. The three-letter language code conforms to LDC's internal convention based on the new ISO 639-3 standard. The seven-letter codes are used in both the directory names where the data files are found, and in the prefix that appears at the beginning of every data file name. As with other Gigaword releases, some of the content in the this corpus has been published previously by the LDC in a variety of other, older corpora, particularly the North American News text corpora, the various TDT corpora, and the AQUAINT text corpus, as well as earlier editions of Gigaword English. DATA FORMAT AND SGML MARKUP --------------------------- Each data file name consists of the 7-letter prefix plus another underscore character, followed by a 6-digit date representing the year and month during which the file contents were generated by the respective news source, followed by a ".gz" file extension indicating that the file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). So, each file contains all the usable data received by LDC for the given month from the given news source. All text data are presented in SGML form, using a very simple, minimal markup structure; all text consists of printable ASCII and whitespace. The file "gigaword.dtd" in the "dtd" directory provides the formal "Document Type Declaration" for parsing the SGML content. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using this DTD file. The markup structure, common to all data files, can be summarized as follows: The Headline Element is Optional -- not all DOCs have one The Dateline Element is Optional -- not all DOCs have one

Paragraph tags are only used if the 'type' attribute of the DOC happens to be "story" -- more on the 'type' attribute below...

Note that all data files use the UNIX-standard "\n" form of line termination, and text lines are generally wrapped to a width of 80 characters or less.

For every "opening" tag (DOC, HEADLINE, DATELINE, TEXT, P), there is a corresponding "closing" tag -- always. The attribute values in the DOC tag are always presented within double-quotes; the "id=" attribute of DOC consists of the 7-letter source/language abbreviation (in CAPS), an underscore, an 8-digit date string representing the date of the story (YYYYMMDD), a period, and a 4-digit sequence number starting at "0001" for each date (e.g. "NYT_ENG_199501.0001"); in this way, every DOC in the corpus is uniquely identifiable by the id string. There are cases where we have assigned a sequence number to a document, and later, we have found out the document is empty or very noisy. In such cases, we have removed the document from the collection, but did not reassign sequence numbers to the rest of the collection for the same day. In addition, there are cases in which data were processed after the bulk of a day's documents; in these cases, additional documents are given sequence numbers starting at a higher point. As a result there may be some gaps in sequence numbers. Every SGML tag is presented alone on one line, separate from other tags, and from the text content (so a simple process like the UNIX "grep -v '<'" will eliminate all tags, and retain all the text content). The structure shown above represents some notable differences relative to the markup strategy employed in previous LDC text corpora; these are intended to facilitate bulk processing of the present corpus. The major differences are: - Earlier corpora usually organized the data as one file per day, or limited the average file size to one megabyte (MB). Typical compressed file sizes in the current corpus range from about 3 MB (1995 Xinhua data) to about 30 MB (1996-7 NYT data); this equates to a range of about 9 to 90 MB when the data are uncompressed. In general, these files are not intended for use with interactive text editors or word processing software (though many such programs are likely to work reasonably well with these files). Rather, it's expected that the files will be used as input to programs that are geared to dealing with data in such quantities, for filtering, conditioning, indexing, statistical summary, etc. (The LDC can provide open source software, mostly written in Perl, for extracting DOCs from such data files, using the "id" string or other search criteria for story selection; see http://www.ldc.upenn.edu/Using/ .) - Earlier corpora tended to use different markup outlines (different tag sets) depending on the source of the data, because different sources came to us with different structural properties, and we had chosen to preserve these as much as possible (even though many elements of the delivered structure may have been meaningless for research use). The present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs). The "dateline" is a brief string typically found at the beginning of the first paragraph in each news story, giving the location the report is coming from, and sometimes the news service and/or date; since this content is not part of the initial sentence, we separate it from the first paragraph (this was not done in previous corpora). - Earlier corpora tended to include "custom" SGML entity references, which were intended to preserve things like special punctuation or typesetting instructions (e.g. "&QL;", "&UR;", "&MD;", etc). The present corpus uses only three SGML entity references: ``&'', which represents the literal ampersand "&" character; ``<'', which represents the "left/open angle bracket"; ``>'', which represents the "right/close angle bracket". All other specialized control characters have been filtered out, and unusual punctuation (such as the underscore character, used in NYT_ENG and APW_ENG to represent an "em-dash" character) has been left as-is, or converted to simple equivalents (e.g. hyphens). - In earlier corpora, newswire data were presented as streams of undifferentiated "DOC" units; depending on the source and corpus, varying amounts of quality checking and filtering were done to eliminate noisy or unsuitable content (e.g. test messages). The portions of this corpus that were included in the first edition of the English Gigaword corpus have received a uniform treatment in terms of quality control. The new material added in this edition has been initially processed by LDC's daily newswire processing pipeline to create initial mark-up, and then were re-processed follow the design used in the first edition of the Gigaword corpus. The same extent of quality control has been applied to the new material. However, there may be cases where some treatments of data, such as the categorization of DOC units, have changed. For all of the documents in this corpus, we have applied a rudimentary (and _approximate_) categorization of DOC units into four distinct "types". The classification is indicated by the `` type="string" '' attribute that is included in each opening ``DOC'' tag. The four types are: * story : This is by far the most frequent type, and it represents the most typical newswire item: a coherent report on a particular topic or event, consisting of paragraphs and full sentences. As indicated above, the paragraph tag "

" is found only in DOCs of this type; in the other types described below, the text content is rendered with no additional tags or special characters -- just lines of ASCII tokens separated by whitespace. * multi : This type of DOC contains a series of unrelated "blurbs", each of which briefly describes a particular topic or event; this is typically applied to DOCs that contain "summaries of todays news", "news briefs in ... (some general area like finance or sports)", and so on. Each paragraph-like blurb by itself is coherent, but it does not bear any necessary relation of topicality or continuity relative to it neighboring sections. * advis : (short for "advisory") These are DOCs which the news service addresses to news editors -- they are not intended for publication to the "end users" (the populations who read the news); as a result, DOCs of this type tend to contain obscure abbreviations and phrases, which are familiar to news editors, but may be meaningless to the general public. We also find a lot of formulaic, repetitive content in DOCs of this type (contact phone numbers, etc). * other : This represents DOCs that clearly do not fall into any of the above types -- in general, items of this type are intended for broad circulation (they are not advisories), they may be topically coherent (unlike "multi" type DOCS), and they typically do not contain paragraphs or sentences (they aren't really "stories"); these are things like lists of sports scores, stock prices, temperatures around the world, and so on. The general strategy for categorizing DOCs into these four classes was, for each source, to discover the most common and frequent clues in the text stream that correlated with the three "non-story" types, and to apply the appropriate label for the ``type=...'' attribute whenever the DOC displayed one of these specific clues. When none of the known clues was in evidence, the DOC was classified as a "story". This means that the most frequent classification error will tend to be the use of `` type="story" '' on DOCs that are actually some other type. But the number of such errors should be fairly small, compared to the number of "non-story" DOCs that are correctly tagged as such. Also, since some sources tended to change their delivery methods or format over time, the distribution of non-story types can be seen to vary signficantly by epoch and source. The various "datastats" tables may be helpful in tracking changes in the nature of the source data (and LDC's ability to adapt to those changes). Note that the markup was applied algorithmically, using logic that was based on less-than-complete knowledge of the data. For the most part, the HEADLINE, DATELINE and TEXT tags have their intended content; but due to the inherent variability (and the inevitable source errors) in the data, users may find occasional mishaps where the headline and/or dateline were not successfully identified (hence show up within TEXT), or where an initial sentence or paragraph has been mistakenly tagged as the headline or dateline. DATA QUANTITIES --------------- The "docs" directory contains a set of plain-text tables (datastats_*) that describe the quantities of data by source and month (i.e. by file), broken down according to the four "type" categories. The overall totals for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data you get when the files are uncompressed (i.e. approximately 15 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM; the "K-wrds" numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated. Source #Files Gzip-MB Totl-MB K-wrds #DOCs ------------------------------------------------ afp_eng 146 1732 4937 738322 2479624 apw_eng 193 2700 7889 1186955 3107777 cna_eng 144 86 261 38491 145317 ltw_eng 127 651 1694 268088 411032 nyt_eng 197 3280 8938 1422670 1962178 wpb_eng 12 42 111 17462 26143 xin_eng 191 834 2518 360714 1744025 TOTAL 1010 9325 26348 4032686 9876086 The following tables present "Text-MB", "K-wrds" and "#DOCS" broken down by source and DOC type; "Text-MB" represents the total number of characters (including whitespace) after SGML tags are eliminated. Text-MB K-wrds #DOCs advis afp_eng 152 21675 54414 apw_eng 181 27382 39289 cna_eng 0 24 85 ltw_eng 88 14132 28987 nyt_eng 599 95606 157500 wpb_eng 7 1233 2570 xin_eng 12 1920 7522 TOTAL 1039 161972 290367 multi afp_eng 86 13101 37089 apw_eng 244 40000 58570 cna_eng 23 3786 19415 ltw_eng 19 3086 7020 nyt_eng 124 20435 33183 wpb_eng 0 0 0 xin_eng 134 21473 91997 TOTAL 630 101881 247274 other afp_eng 125 18869 133981 apw_eng 337 47208 273377 cna_eng 2 213 1935 ltw_eng 1 228 1063 nyt_eng 116 17681 26601 wpb_eng 1 308 681 xin_eng 130 18448 161724 TOTAL 712 102955 599362 story afp_eng 4092 684683 2254140 apw_eng 6462 1072376 2736541 cna_eng 211 34476 123882 ltw_eng 1492 250648 373962 nyt_eng 7588 1288942 1744894 wpb_eng 95 15918 22892 xin_eng 1957 318864 1482782 TOTAL 21897 3665907 8739093 GENERAL AND SOURCE-SPECIFIC PROPERTIES OF THE DATA -------------------------------------------------- Much of the text data (all of AFP_ENG, most of APW_ENG, LTW_ENG and NYT_ENG) were received at LDC via dedicated, 24-hour/day electronic feeds (leased phone lines in the case of APW_ENG, LTW_ENG and NYT_ENG, a local satellite dish for AFP_ENG). These 24-hour transmission services were all susceptible to "line noise" (occasional corruption of text content), as well as service outages both at the data source and at our receiving computers. Usually, the various disruptions of a newswire data stream would leave tell-tale evidence in the form of byte values falling outside the range of printable characters, or recognizable patterns of anomalous ASCII strings. All XIN_ENG data, all CNA_ENG data, and a 2-year portion of APW_ENG were received as bulk electronic text archives via internet retrieval. As such, they were not susceptible to modem line-noise or related disruptions, though this does not guarantee that the source data are free of mishaps. Also, the more recent portions of APW_ENG, LTW_ENG and NYT_ENG have been delivered by various internet-based subscription systems (explained in more detail in the source-specific sections below); again, this has eliminated the various problems with modem noise, but does not assure "perfect" data. All the data have undergone a consistent extent of quality control, to improper characters and other obvious forms of corruption. Naturally, since the source data are all generated manually on a daily basis, there will be a small percentage of human errors common to all sources: missing whitespace, incorrect or variant spellings, badly formed sentences, and so on, as are normally seen in newspapers. No attempt has been made to address this property of the data. As indicated above, a common feature of the modem-based archives is that stories may be repeated in the course of daily transmissions (or daily archiving). Sometimes a later transmission of a story comes with minor alterations (fixed spelling, one or more paragraphs added or removed); but just as often, the collection ends up with two or more DOCs that are fully identical. In general, though, this practice affects a relatively small minority of the overall content. (NYT_ENG is perhaps the worst offender in this regard, sometimes sending as many as six copies of some featured story.) We have not attempted to eliminate these duplications; however, we plan to make information about duplicate and similar articles available on our web site as supplemental information for this corpus. (See the "ADDITIONAL INFORMATION and UPDATES" section below.) Finally, some of the modem services typically show a practice of breaking long stories into chunks, and sending the chunks as separate DOC units, with each unit having the normal structural features of a full story. (This is especially prevalent in NYT_ENG, which has the longest average story length of all the sources.) Normally, when this sort of splitting is done, cues are provided in the text of each chunk that allow editors to reconstruct the full report; but these cues tend to rely heavily on editorial skills -- it is taken for granted by each news service that the stories will be reassembled manually as needed -- so the process of combining the pieces into a full story is not amenable to an algorithmic solution, and no attempt has been made to do this. Also, some sources (especially NYT and LTW) include advisory annotations in the longer stories, providing guidance on how such stories can be abridged (e.g. "(STORY CAN END HERE, OPTIONAL MATERIAL FOLLOWS)", and other such phrases, typically parenthesized and in all caps. The following sections explain data properties that are particular to each source. AFP_ENG: There is a gap of 54 months in the AFP_ENG collection (about four and a half years), spanning from May 1997 to December 2001; the LDC had discontinued its subscription to the AFP English wire service during this period, and at the point where we restored the subscription near the end of 2001, there was no practical means for recovering the portion that was missed. There is also a gap spanning the periods from September 20, 2002 to October 2, 2002, from August 6, 2003 to September 10, 2003, and February 13, 2008 through February 27, 2008. During 2007, LDC's AFP feed switched to a new delivery method. Although the data content appears to be fairly consistent with previously collected content, LDC has not done detailed analysis to determine the level of consistency. Apart from these, the AFP_ENG content shows a high degree of internal consistency (relative to APW_ENG and NYT_ENG), in terms of day-to-day content and typographic conventions. A bug was discovered in LDC's script that processes data from AFP. Consequently, a number of documents received duplicate IDs. In this publication, documents were automatically re-assigned IDs to ensure uniqueness. The file afp_eng_reassigned_ids.tab is a tab delimited file indciating the documents which had their IDs reassigned. The first line of the file contains a header row with the field name label: original_id - The original ID that was assigned to the document replacement_id - The automatically generated replacement ID that was assigned to the document APW_ENG: This service provides up to six other languages besides English on the same modem connection, with DOCs in all languages interleaved at random; of course, we have extracted just the English content for publication here. The service draws news from quasi-independent offices around the world, so there tends to be more variability here in terms of typographic conventions; there is also a noticeably higher percentage of non-story content, especially in the "other" category: tables of sports results, stocks, weather, etc. During the period between August 1999 and August 2001, the modem service failed to deliver English content, while data in other languages continued to flow in. (LDC was spooling the data automatically, and during this period, alarms would be raised only if the data flow stopped completely -- so the absence of English went unnoticed.) On learning of this gap in the data, we were able to recover much of the missing content with help from AP's New York City office and from Richard Sproat at AT&T Labs -- we gratefully acknowledge their assistance. Both were able to supply bulk archives that covered most of the period that we had missed. In particular, August - November 1999 and January - September 2000 were retrieved from USENET/ClariNet and web archives that AT&T had collected for its own research use, while the October 2000 - August 2001 data were supplied by AP directly from their own web service archive. As a result of the varying sources, these sub-parts of APW_ENG data tend to differ from the rest of the collection (and from each other), in terms of daily quantity, extent of typographic variance, and possibly the breadth of subject matter being reported. Among the data added in this edition, the data from January 2004 contained particularly noisy data due to transmission errors. We have removed documents containing explicit noises from this month. Starting in May 2004, APW switched to a dedicated internet delivery system, eliminating the problems of modem noise and also creating a much better environment for limiting or avoiding duplicate content in stories. This system of collection continued to operate until the end of August, 2006. At that point, there was a brief lapse in the collection (roughly the first half of September 2006 is missing from our archives), and then data reception switched to a "Network News Transfer Protocol" (NNTP, related to Usenet transmission). Under this delivery method, we found that many stories were being delivered two or three times each, but it has proven to be fairly easy to remove these duplications. CNA_ENG: The amount of data for this source is relatively small compared to other sources. This data set has been delivered to the LDC via internet transfer. As a result, we avoided many of the problems that commonly afflict newswire data collected over modems. There is a large gap of 16 months from April 2002 to July 2003 in this data set. When this source was first released in Gigaword English II, the data had been incorrectly assumed to be ASCII only, and when non-ASCII bytes were found, they were simply removed. In preparing the current release, we found that the CNA source data actually used the Big-5 ("Traditional Chinese") character set in various irregular ways, usually to render "full-width" variants of ASCII letters, digits and punctuation. The approach taken in the previous release caused many of these "wide" characters to end up as data corruption, particularly when the second byte of the Big-5 wide character happened to fall in the ASCII range (which is common for the Big-5 "full-width" versions of ASCII characters). For the current release, all the CNA data has been reprocessed from original sources and correctly converted from Big-5 to UTF-8; where appropriate, we have normalized the "full-width character" variants to their corresponding ASCII equivalents. LTW_ENG: There is a gap of about 62 months (mid-June 1998 through early August 2003) during which the LDC had dropped its subscription. The data were collected via dedicated modem up until March 2004, at which point the delivery was switched to E-mail transmission, eliminating data loss due to modem noise. The effect of the transmission change on duplicated material has not been determined, but this source has tended to show a relatively low degree of duplication. LTW provides not only the content that is specific to the daily newspapers published in Los Angeles and Washington, D.C., but also a sampling of newspaper content from other papers in other cities. Please be aware that, in order to ensure the corpus fits onto DVDs, the contents of LTW_ENG were split across discs 1 and 2. The first disc contains data through 1998, the second disc contains data from 2003 onwards. Finally, please note that the LTW wire service ceased to exist at the end of 2009, and was replaced with Washington Post/Bloomberg (WPB_ENG), which is described below. WPB_ENG: The Washington Post/Bloomberg source is a successor to LTW_ENG, described above. The content is limited to stories published by the Washington Post, or on Bloomberg's news service. NYT_ENG: Prior to 2003, there had been only a few scattered service interruptions for NYT_ENG, and these typically involve gaps of a few days (the longest was about two weeks). However, there was a time period, from February 2003 to June 2004, in which pervasive modem noise induced a significant amount of character data corruption, affecting the control-character story-boundary markers as well as the text content of the stories themselves. We have filtered out documents that showed explicit evidence of corruption. As a result, there is a smaller amount of documents in this time period. In particular, there is no data from June 2004, and there is very little data from May 2004, included in this release. Also, even after filtering out stories that showed explicit evidence of corruption (invalid sequences of story-boundary control codes, occurrences of inappropriate byte values), there are still likely to be "non-explicit" cases of data corruption in the stories that remain for this time period. Despite a shift to internet-based delivery, NYT_ENG continues to experience corruption similar to modem line noise. For example, this paragraph from NYT_ENG_20091101.0031:

The filing marks the culmination of months of bargaining among CIT, its creditors and the federal government over the company's fate. Bank regulators concluded over the summer that even though CIT was vital to many small businesses that needed financing, the company's proble?; "1? ??\ Z ] \ ?¢ ?\?]Y[?ê\?? Z [ Yâ? Z H??Ö\??\Ü] ê£escuesofCitigroupandBankofAmerica.

Analysis of the NYT data from 2009 and 2010 produced a set of characters which only appears in the kind of noise described above: Codepoint Name --------- ------------------------------------- U+00A5 YEN SIGN U+00C9 LATIN CAPITAL LETTER E WITH ACUTE U+00C9 LATIN CAPITAL LETTER E WITH ACUTE U+00D1 LATIN CAPITAL LETTER N WITH TILDE U+00D6 LATIN CAPITAL LETTER O WITH DIAERESIS U+00DC LATIN CAPITAL LETTER U WITH DIAERESIS U+00E0 LATIN SMALL LETTER A WITH GRAVE U+00E5 LATIN SMALL LETTER A WITH RING ABOVE U+00EA LATIN SMALL LETTER E WITH CIRCUMFLEX U+00EF LATIN SMALL LETTER I WITH DIAERESIS U+00EF LATIN SMALL LETTER I WITH DIAERESIS U+00FB LATIN SMALL LETTER U WITH CIRCUMFLEX U+00FF LATIN SMALL LETTER Y WITH DIAERESIS U+00FF LATIN SMALL LETTER Y WITH DIAERESIS To assist Gigaword users in avoiding documents with the type of corruption described above, LDC has provided a list of DOCIDs and paragraph numbers containing one or more of these characters. This list is in nyt_noisy_paragraphs.tab in the docs directory. The file is tab delimited, where the first row contains column headers; paragraph numbers are relative to a particular document, and are 1-based (i.e., the first paragraph in a document is 1, the second is 2, and so on). Be aware that the list and problematic characters described above likely do not cover all instances of corrupt data in NYT_ENG. During the preparation of the corpus, we found a number of what appear to be incorrectly encoded characters, typically outside of the ASCII range. Where these errors were systematic, we corrected the data by substituting the appropriate characters. Unfortunately, in many instances, the substitutions are non-systematic (the most common replacement being an ASCII question mark, "?"), and automatic replacement was not practical. It should be noted that NYT_ENG documents from 16 days in July 2002 -- all odd numbered days -- have been intentionally excluded from this collection in order to satisfy a contractual agreement with a partner site. The NYT_ENG service provides not only the content that is specific to the New York Times daily newspaper publication, but also a wide and varied sampling of news and features from other urban and regional newspapers around the U.S., including: Albany Times Union Arizona Republic Atlanta Constitution Bloomberg Business News Boston Globe Casper (Wyo.) Star-Tribune Chicago Sun-Times Columbia News Service Cox News Service Fort Worth Star-Telegram Hearst Newspapers Houston Chronicle International Herald Tribune Kansas City Star Los Angeles Daily News San Antonio Express-News San Francisco Chronicle Seattle Post-Intelligencer States News Service Typically, the actual source of a given DOC was indicated in the raw data via an abbreviation (e.g. AZR, BLOOM, COX, LADN, NYT, SPI, etc) at the end of the "slug" line that accompanies every story. (The "slug" is a short string, usually less than 40 characters, that news editors use to tag and sort stories and topics over the course of a day.) Because this feature of NYT_ENG slug lines is quite consistent and informative, the markup strategy was adapted to make sure that the full slug line would be included as part of the content of the "DATELINE" tag whenever possible. (Slugs were either not present or not retained in the other three newswire sources.) Some examples: TEMPE, Ariz. (BC-FIESTA-BLOCK-AZR) LOS ANGELES (BC-BKN-LAKERS-ONEAL-LADN) NEW YORK (BC-NY-NEWYEAR-ART-1STLD-WRITETHRU-675&ADD-NYT) (BC-OBIT-KENNEDY-NYT) The first three examples are cases where the opening paragraph had a dateline string; in the fourth, the opening paragraph had no dateline. The slug is normally ALL-CAPS-AND-HYPHENS (this is how it is presented by the newswire service -- there are some exceptions, of course, and the occasional glitch); it is always preceded by a space and an open parenthesis, and always followed by a close parenthesis. Meanwhile, the dateline string taken from the first paragraph (when present) is always presented first on the line, with no initial space; it can be mixed-case, may have multiple word tokens, and may have punctuation. Features of text formatting, style and subject matter may vary somewhat according to the original source. Overall, NYT_ENG shows the largest amount of "advisory" content, both in terms of how many DOCs are addressed specifically to the receiving news editors, and in terms of additional "advice" included within regular news stories, e.g. "(STORY CAN END HERE. OPTIONAL MATERIAL FOLLOWS)". In 2008, it was brought to our attention that the New York Times newswire text archive in this corpus contained some articles in Spanish. After doing a complete scan of the 149 monthly data files under "nyt_eng", we identified 2517 DOC elements with the 'type="story"' attribute where the story content was in Spanish. In the process, we also found 421 DOC elements with the 'type="story"' attribute where the text content was in fact not a news story. We have added two additional files to the LDC's Online Documentation set for this corpus (available from the "Online documentation: yes" link on the catalog web page for LDC2011T07): other.file-doc.map spanish.file-doc.map The first map file lists the file names and DOC ID strings for the 421 DOC elements that were incorrectly labeled as 'type="story"'; the second lists the file names and DOC ID strings for the 2517 DOC elements containing Spanish text. Users of the corpus who focus on the "story" classification for their work may want to eliminate the listed DOC IDs from consideration in their processing of the data. The affected DOC elements were also present in previous releases of Gigaword English, to the extent that the dates of the affected DOCs fall within the time spans covered by the earlier versions of the corpus. We would like to express our gratitude to Paul Cook of the Department of Computer Science at the University of Toronto for bringing this problem to our attention. XIN_ENG: The Xinhua English news archive provided fairly consistent formatting and coverage spanning 1995 through 2004, making it fairly easy to prepare for research use. Many stories have the distinct flavor of an official government information source, in contrast to the other news services represented here. The material is otherwise unremarkable. ADDITIONAL INFORMATION AND UPDATES ---------------------------------- Additional information, updates, and bug fixes may be available in the LDC catalog entry for this corpus (LDC2009T13) at: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T07 -------------------------------------------------------------------------- Original README file written by David Graff, January 2003 Updated by Junbo Kong and Kazuaki Maeda for the Second Edition, June 2005 Updated by David Graff for the Third Edition, May 2007 Updated by Robert Parker for the Fourth Edition, April 2009 Updated by Robert Parker for the Fifth Edition, April 2011 Linguistic Data Consortium