README File for the GIGAWORD CHINESE TEXT CORPUS ================================================ Fourth Edition ============= LDC2009T27 ============= INTRODUCTION ------------ The Gigaword Chinese Corpus is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC), at the University of Pennsylvania. This is the third edition of the Gigaword Chinese Corpus. This edition includes all of the contents in the previous edition of the Chinese Gigaword corpus (LDC2007T38) as well as new data collected after the publication of that edition. In addition, four new sources have been added = Central News Service, Guangming Daily, People's Daily, and People's Liberation Army Daily. The four distinct international sources of Chinese newswire included in this edition are the following: - Agence France Presse (afp_cmn) - Central News Agency, Taiwan (cna_cmn) - Central News Service (cns_cmn) - Guangming Daily (gmw_cmn) - People's Daily (pda_cmn) - People's Liberation Army Daily (pla_cmn) - Xinhua News Agency (xin_cmn) - Zaobao Newspaper (zbn_cmn) The seven-letter codes in the parentheses above are used for the directory names and data files for each source, and are also used (in ALL_CAPS) as part of the unique DOC "id" string assigned to each news article. WHAT'S NEW IN THE FOURTH EDITION -------------------------------- Two years worth of new articles (January 2007 through December 2008) have been added to the Xinhua, Agence France Presse, and CNA data sets. Four new data sources have been added - Guangming Daily, Central News Service , People's Daily, and People's Liberation Army daily, covering a timespan from November 2006 through December 2008. CHARACTER ENCODING ------------------ The original data received by the LDC from AFP, People's Liberation Army Daily, Xinhua, and Zaobao were encoded in GB-2312, those from CNA were in Big-5, and those from GMW, CNS, and People's Daily were in a combination of GB-2312 and GB-18030. To avoid the problems and confusion that could result from differences in character-set specifications, all text files in this corpus have been converted to UTF-8 character encoding. Researchers who have concerns about the comparability and compatibility of text data from GB and Big-5 sources should consult The Unicode Standard (published by the Unicode Consortium, http://www.unicode.org), paying special attention to Chapter 10, "East Asian Scripts", and Appendix A, "Han Unification History". Owing to the use of UTF-8, the SGML tagging within each file (described in detail in the next section) shows up as lines of single-byte-per-character (ASCII) text, whereas lines of actual text data, including article headlines and datelines, contain a mixture of single-byte and multi-byte characters. Both Big-5 and GB are designed to support ASCII single-byte character data as well as 2-byte Chinese characters; in addition, each of these coding standards has a section of the 2-byte character space devoted to "full-width" renderings of the printable ASCII characters. For example, the digits 0-9 can be presented as either single-byte ASCII codes or as 2-byte full-width codes, as shown in the following table: Digit ASCII GB 2-byte Big-5 2-byte Character byte code-point code-point -------------------------------------------------- 0 0x30 0xA3C0 0xA2AF 1 0x31 0xA3C1 0xA2B0 2 0x32 0xA3C2 0xA2B1 3 0x33 0xA3C3 0xA2B2 4 0x34 0xA3C4 0xA2B3 5 0x35 0xA3C5 0xA2B4 6 0x36 0xA3C6 0xA2B5 7 0x37 0xA3C7 0xA2B6 8 0x38 0xA3C8 0xA2B7 9 0x39 0xA3C9 0xA2B8 and similarly for the upper- and lower-case alphabet characters, brackets, quotation marks and punctuation. We found that some sources showed evidence of somewhat free variation between single-and two-byte forms when presenting alphanumerics, etc, within the text data. Although the Unicode Standard provides an analogous portion of its code table to these full-width characters, we decided instead to eliminate this form of variation in the data: wherever the original data contained 2-byte versions of characters having exact correlates in the single-byte ASCII table, we replaced the 2-byte character with the single-byte ASCII equivalent. As a result, many lines of text data contain a mix of multi-byte Chinese and single-byte ASCII content. Of course, since all the data is now presented in UTF-8 encoding, this mixture is a natural property of the data, which any UTF-8-aware process will handle without difficulty. We also found that all sources use a handful of "accented" alphabetics and other special characters common to European character sets. When converted to UTF8, these characters assume their "normal" places in the Unicode table -- e.g. the "raised circle", used as a "degrees" mark in temperatures or latitude/longitude coordinates, can be found in the Xinhua data rendered as U00B0 (which in UTF8 form comes out as the two-byte sequence 0xC2 0xB0). Apart from these rare cases, all characters in the text are either single-byte ASCII or multi-byte Chinese. DATA FORMAT AND SGML MARKUP --------------------------- Each data file name consists of the 7-letter prefix (e.g., xin_cmn) and an underscore character ('_') followed by a 6-digit date (representing the year and month during which the file contents were originally published by the respective news source), followed by a ".gz" file extension, indicating that the file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). So, each file contains all the usable data received by LDC for the given month from the given news source. All text data are presented in SGML form, using a very simple, minimal markup structure. The file "gigaword_c.dtd" in the "docs" directory provides the formal "Document Type Declaration" for parsing the SGML content. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using this DTD file. The markup structure, common to all data files, can be summarized as follows: The Headline Element is Optional -- not all DOCs have one The Dateline Element is Optional -- not all DOCs have one

Paragraph tags are only used if the 'type' attribute of the DOC happens to be "story" -- more on the 'type' attribute below...

Note that all data files use the UNIX-standard "\n" form of line termination, and text lines are generally wrapped to a width of 40 characters or less.

For every "opening" tag (DOC, HEADLINE, DATELINE, TEXT, P), there is a corresponding "closing" tag -- always. The attribute values in the DOC tag are always presented within double-quotes; the "id=" attribute of DOC consists of the 7-letter source and language ID (in CAPS), an underscore, an 8-digit date string representing the date of the story (YYYYMMDD), a period, and a 4-digit sequence number starting at "0001" for each date (e.g. "XIN_CMN_199501.0001"); in this way, every DOC in the corpus is uniquely identifiable by the id string. Every SGML tag is presented alone on one line, separate from other tags, and from the text content (so a simple process like the UNIX "grep -v '<'" will eliminate all tags, and retain all the text content). The structure shown above represents some notable differences relative to the markup strategy employed in previous LDC text corpora; these are intended to facilitate bulk processing of the present corpus. The major differences are: - Earlier corpora usually organized the data as one file per day, or limited the average file size to one megabyte (MB). Typical compressed file sizes in the current corpus range from about 500 KB (2003 CNA data) to about 10 MB (2001-2 CNA data); this equates to a range of about 1.5 to 27 MB when the data are uncompressed. In general, these files are not intended for use with interactive text editors or word processing software (though many such programs are likely to work reasonably well with these files). Rather, it's expected that the files will be used as input to programs that are geared to dealing with data in such quantities, for filtering, conditioning, indexing, statistical summary, etc. (The LDC can provide open source software, mostly written in Perl, for extracting DOCs from such data files, using the "id" string or other search criteria for story selection; see http://www.ldc.upenn.edu/Using/ .) - Earlier corpora tended to use different markup outlines (different tag sets) depending on the source of the data, because different sources came to us with different structural properties, and we had chosen to preserve these as much as possible (even though many elements of the delivered structure may have been meaningless for research use). The present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs). The "dateline" is a brief string typically found at the beginning of the first paragraph in each news story, giving the location the report is coming from, and sometimes the news service and/or date; since this content is not part of the initial sentence, we separate it from the first paragraph (this was not done in previous corpora). - Earlier corpora tended to include "custom" SGML entity references, which were intended to preserve things like special punctuation or typesetting instructions (e.g. "&QL;", "&UR;", "&MD;", etc). The present corpus uses only three SGML entity reference: - ``&'' represents the literal ampersand "&" character - ``<'' represents the literal open-angle bracket "<" - ``>'' represents the literal close-angle bracket ">" All other specialized control characters have been filtered out. - In earlier corpora, newswire data were presented as streams of undifferentiated "DOC" units; depending on the source and corpus, varying amounts of quality checking and filtering were done to eliminate noisy or unsuitable content (e.g. test messages). For this release, all sources have received a uniform treatment in terms of quality control, and we have applied a rudimentary (and _approximate_) categorization of DOC units into four distinct "types". The classification is indicated by the `` type="string" '' attribute that is included in each opening ``DOC'' tag. The four types are: * story : This is by far the most frequent type, and it represents the most typical newswire item: a coherent report on a particular topic or event, consisting of paragraphs and full sentences. As indicated above, the paragraph tag "

" is found only in DOCs of this type; in the other types described below, the text content is rendered with no additional tags or special characters -- just lines of ASCII tokens separated by whitespace. * multi : This type of DOC contains a series of unrelated "blurbs", each of which briefly describes a particular topic or event; this is typically applied to DOCs that contain "summaries of todays news", "news briefs in ... (some general area like finance or sports)", and so on. Each paragraph-like blurb by itself is coherent, but it does not bear any necessary relation of topicality or continuity relative to it neighbors. * advis : (short for "advisory") These are DOCs which the news service addresses to news editors -- they are not intended for publication to the "end users" (the populations who read the news); as a result, DOCs of this type tend to contain obscure abbreviations and phrases, which are familiar to news editors, but may be meaningless to the general public. We also find a lot of formulaic, repetitive content in DOCs of this type (contact phone numbers, etc). * other : This represents DOCs that clearly do not fall into any of the above types -- in general, items of this type are intended for broad circulation (they are not advisories), they may be topically coherent (unlike "multi" type DOCS), and they typically do not contain paragraphs or sentences (they aren't really "stories"); these are things like lists of sports scores, stock prices, temperatures around the world, and so on. The general strategy for categorizing DOCs into these four classes was, for each source, to discover the most common and frequent clues in the text stream that correlated with the three "non-story" types, and to apply the appropriate label for the ``type=...'' attribute whenever the DOC displayed one of these specific clues. When none of the known clues was in evidence, the DOC was classified as a "story". This means that the most frequent classification error will tend to be the use of `` type="story" '' on DOCs that are actually some other type. But the number of such errors should be fairly small, compared to the number of "non-story" DOCs that are correctly tagged as such. Note that the markup was applied algorithmically, using logic that was based on less-than-complete knowledge of the data. For the most part, the HEADLINE, DATELINE and TEXT tags have their intended content; but due to the inherent variability (and the inevitable source errors) in the data, users may find occasional mishaps where the headline and/or dateline were not successfully identified (hence show up within TEXT), or where an initial sentence or paragraph has been mistakenly tagged as the headline or dateline. DATA QUANTITIES --------------- The "docs" directory contains a set of plain-text tables (datastats_*) that describe the quantities of data by source and month (i.e. by file), broken down according to the four "type" categories. The overall totals for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data you get when the files are uncompressed; the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM; the "K-wrds" numbers are actually the number of Chinese characters (there is no notion of "space separated word tokens" in Chinese, and for these tallies, we are not counting ASCII or other non-Chinese characters in the data): Source #Files Gzip-MB Totl-MB K-wrds #DOCs afp_cmn 99 72 187 51590 134293 cna_cmn 188 1394 3077 1014539 2284971 cns_cmn 26 233 525 164474 236745 gmw_cmn 26 59 121 41886 45791 pda_cmn 26 253 312 109657 160098 pla_cmn 26 82 183 58735 90019 xin_cmn 192 920 2212 647735 1373448 zbn_cmn 13 44 100 30006 45235 TOTAL 596 3057 6717 2118622 4370600 The following tables present "K-wrds" (i.e. thousands of Chinese characters) and "#DOCS" broken down by source and DOC type: # DOCS Kwords TextKB advis afp_cmn 0 0 0 cna_cmn 9545 801 2957 cns_cmn 0 0 0 gmw_cmn 0 0 0 pda_cmn 0 0 0 pla_cmn 0 0 0 xin_cmn 6971 726 2391 zbn_cmn 0 0 0 TOTAL 16516 1527 5348 multi afp_cmn 0 0 0 cna_cmn 40338 31131 102843 cns_cmn 0 0 0 gmw_cmn 0 0 0 pda_cmn 0 0 0 pla_cmn 0 0 0 xin_cmn 11428 7518 23818 zbn_cmn 105 186 596 TOTAL 51871 38835 127257 other afp_cmn 0 0 0 cna_cmn 123631 48446 182043 cns_cmn 0 0 0 gmw_cmn 0 0 0 pda_cmn 0 0 0 pla_cmn 0 0 0 xin_cmn 40851 14068 51463 zbn_cmn 279 128 443 TOTAL 164761 62642 233949 story afp_cmn 134293 51590 163396 cna_cmn 2111457 934161 2935979 cns_cmn 236745 164474 523425 gmw_cmn 45791 41886 132566 pda_cmn 160098 109657 347947 pla_cmn 90019 58735 187552 xin_cmn 1314198 625423 1972065 zbn_cmn 44851 29692 92456 TOTAL 4137452 2015618 6355386 GENERAL PROPERTIES OF THE DATA ------------------------------ All of the data sets have been produced from bulk archives that were delivered to the LDC via internet transfer. As a result, we avoided many of the problems that commonly afflict newswire data that has been transmitted over modems. Still, some sources contained noticeable amounts of "noise" (unusable characters, null bytes, etc) which had to be filtered out for research use. Two of the corpus authors at the LDC, Ke Chen and Junbo Kong, are native speakers of Mandarin Chinese, and did extensive diagnosis to identify and eliminate unsuitable content in the original archival data. To some extent, this is an open-ended problem, and there may be kinds of error conditions that have gone unnoticed or untreated -- this is true of any large text collection -- but we have striven to assure that the characters presented in all files are in fact valid and displayable, and that the markup is fully SGML compliant. It is often the case that data from a given source contains duplicate copies of documents. In this version of Gigaword, MD5 sums were generated for the bodies of the newly added documents for each source. Where mutiple documents had the same MD5 sum, only the first document (by collection date) was included and the remainder discarded. This has the effect of producing a large number of document ID gaps in some sources. A file called "removed_duplicates.tab" is included in the documentation for this package. The file is tab delimited, where the first row are headers labeling the columns: removed_id - the document ID that was removed from the corpus retained_id - the document ID that was left in the corpus Note that for a given retained_id, there may be multiple removed_ids. The removed duplicates table contains 1118544 rows, with the majority of the documents from CNS, GMW and PLA. SOURCE-SPECIFIC PROPERTIES -------------------------- - AFP For this initial release of AFP Chinese news data, the attempt to classify articles into "story", "multi", "advis" and "other" did not receive as much attention as was given to other sources in the earlier releases. A rapid inspection of the data indicated that AFP does not publish "tabular" articles (listings of weather, stocks, sports scores, etc), so the "other" category is essentially non-existent; also, since the data are conveyed via the web, we do not see the kind of content that would fall under the "advis" category. It's likely that there may be a number of stories that should really be called "multi" but have not been identified as such. - CNA In the previous release, there were about 165 empty DOC elements (having no content with the TEXT tags). These DOC elements have been removed; 56 of the cna_cmn files from Edition 2 were affected. - CNS All CNS documents are assumed to be of type story. LDC collects data from CNS' website; owing to flaws in the collection methodology, a significant number of documents were duplicated in the original data pool. 558268 duplicate documents were removed from this release. - GMW All GMW documents are assumed to be of type story. LDC collects data from GMW's website; owing to flaws in the collection methodology, a significant number of documents were duplicated in the original data pool. 193208 duplicate documents were removed from this release. - PDA All PLA documents are assumed to be of type story. - PLA All PLA documents are assumed to be of type story. LDC collects data from PLA's website; owing to flaws in the collection methodology, a significant number of documents were duplicated in the original data pool. 356668 duplicate documents were removed from this release. - Xinhua A bug was discovered in LDC's script that processes data from Xinhua. Consequently, a number of document received duplicate IDs. In this publication, documents were automatically re-assigned IDs to ensure uniqueness. The file xin_cmn_reassigned_ids.tab is a tab delimited file indciating the documents which had their IDs reassigned. The first line of the file contains a header row with the field name label: original_id - The original ID that was assigned to the document replacement_id - The automatically generated replacement ID that was assigned to the document README file written by David Graff and Ke Chen, January 2003 Updated for the Second Edition by Junbo Kong and Kazuaki Maeda, June 2005. Updated for the Third Edition by Dave Graff, June 2007. Updated for the Fourth Edition by Robert Parker, August 2009. Linguistic Data Consortium