README File for the GIGAWORD CHINESE TEXT CORPUS
================================================
Third Edition
=============
INTRODUCTION
------------
The Gigaword Chinese Corpus is a comprehensive archive of newswire
text data that has been acquired over several years by the Linguistic
Data Consortium (LDC), at the University of Pennsylvania. This is the
third edition of the Gigaword Chinese Corpus.
This edition includes all of the contents in the previous edition of
the Chinese Gigaword corpus (LDC2005T14) as well as new data collected
after the publication of that edition. Also, an archive of articles
from a new newspaper source (Agence France Presse) has been added in
this edition.
The four distinct international sources of Chinese newswire included
in this edition are the following:
- Agence France Presse (afp_cmn)
- Central News Agency, Taiwan (cna_cmn)
- Xinhua News Agency (xin_cmn)
- Zaobao Newspaper (zbn_cmn)
The seven-letter codes in the parentheses above are used for the
directory names and data files for each source, and are also used (in
ALL_CAPS) as part of the unique DOC "id" string assigned to each news
article.
WHAT'S NEW IN THE SECOND EDITION
--------------------------------
Over six years worth of articles (October 2000 through December 2006)
from Agence France Presse are being released for the first time.
Two years worth of new articles (January 2005 through December 2006)
have been added to the Xinhua data set.
Nearly two years worth of content was added to the CNA data set.
There was a gap in the LDC's collection from this source during 2006:
no CNA Chinese content was collected between July 27 and December 17
2006, inclusive, so there are no data files for August through
November of that year, and the December data file is about half its
expected size.
A small set of older stories (October through December 1998) have been
added from Zaobao -- these were previously published by LDC as part of
the TDT3 Multilanguage Text corpus, and are being included in Gigaword
for the first time.
CHARACTER ENCODING
------------------
The original data archives received by the LDC from AFP, Xinhua and
Zaobao were encoded in GB-2312, whereas those from CNA were encoded in
Big-5. To avoid the problems and confusion that could result from
differences in character-set specifications, all text files in this
corpus have been converted to UTF-8 character encoding.
Researchers who have concerns about the comparability and
compatibility of text data from GB and Big-5 sources should consult
The Unicode Standard (published by the Unicode Consortium,
http://www.unicode.org), paying special attention to Chapter 10, "East
Asian Scripts", and Appendix A, "Han Unification History".
Owing to the use of UTF-8, the SGML tagging within each file
(described in detail in the next section) shows up as lines of
single-byte-per-character (ASCII) text, whereas lines of actual text
data, including article headlines and datelines, contain a mixture of
single-byte and multi-byte characters.
Both Big-5 and GB are designed to support ASCII single-byte character
data as well as 2-byte Chinese characters; in addition, each of these
coding standards has a section of the 2-byte character space devoted
to "full-width" renderings of the printable ASCII characters. For
example, the digits 0-9 can be presented as either single-byte ASCII
codes or as 2-byte full-width codes, as shown in the following table:
Digit ASCII GB 2-byte Big-5 2-byte
Character byte code-point code-point
--------------------------------------------------
0 0x30 0xA3C0 0xA2AF
1 0x31 0xA3C1 0xA2B0
2 0x32 0xA3C2 0xA2B1
3 0x33 0xA3C3 0xA2B2
4 0x34 0xA3C4 0xA2B3
5 0x35 0xA3C5 0xA2B4
6 0x36 0xA3C6 0xA2B5
7 0x37 0xA3C7 0xA2B6
8 0x38 0xA3C8 0xA2B7
9 0x39 0xA3C9 0xA2B8
and similarly for the upper- and lower-case alphabet characters,
brackets, quotation marks and punctuation. We found that both
archives showed evidence of somewhat free variation between single-
and two-byte forms when presenting alphanumerics, etc, within the text
data. Although the Unicode Standard provides an analogous portion of
its code table to these full-width characters, we decided instead to
eliminate this form of variation in the data: wherever the original
data contained 2-byte versions of characters having exact correlates
in the single-byte ASCII table, we replaced the 2-byte character with
the single-byte ASCII equivalent. As a result, many lines of text
data contain a mix of multi-byte Chinese and single-byte ASCII
content. Of course, since all the data is now presented in UTF-8
encoding, this mixture is a natural property of the data, which any
UTF-8-aware process will handle without difficulty.
We also found that all sources use a handful of "accented" alphabetics
and other special characters common to European character sets. When
converted to UTF8, these characters assume their "normal" places in
the Unicode table -- e.g. the "raised circle", used as a "degrees"
mark in temperatures or latitude/longitude coordinates, can be found
in the Xinhua data rendered as U00B0 (which in UTF8 form comes out as
the two-byte sequence 0xC2 0xB0). Apart from these rare cases, all
characters in the text are either single-byte ASCII or multi-byte
Chinese.
DATA FORMAT AND SGML MARKUP
---------------------------
Each data file name consists of the 7-letter prefix (e.g., xin_cmn)
and an underscore character ('_') followed by a 6-digit date
(representing the year and month during which the file contents were
originally published by the respective news source), followed by a
".gz" file extension, indicating that the file contents have been
compressed using the GNU "gzip" compression utility (RFC 1952). So,
each file contains all the usable data received by LDC for the given
month from the given news source.
All text data are presented in SGML form, using a very simple, minimal
markup structure. The file "gigaword_c.dtd" in the "docs" directory
provides the formal "Document Type Declaration" for parsing the SGML
content. The corpus has been fully validated by a standard SGML
parser utility (nsgmls), using this DTD file.
The markup structure, common to all data files, can be summarized as
follows:
Paragraph tags are only used if the 'type' attribute of the DOC
happens to be "story" -- more on the 'type' attribute below...
Note that all data files use the UNIX-standard "\n" form of line
termination, and text lines are generally wrapped to a width of 40
characters or less.
" is found only in DOCs of this type; in the other types described below, the text content is rendered with no additional tags or special characters -- just lines of ASCII tokens separated by whitespace. * multi : This type of DOC contains a series of unrelated "blurbs", each of which briefly describes a particular topic or event; this is typically applied to DOCs that contain "summaries of todays news", "news briefs in ... (some general area like finance or sports)", and so on. Each paragraph-like blurb by itself is coherent, but it does not bear any necessary relation of topicality or continuity relative to it neighbors. * advis : (short for "advisory") These are DOCs which the news service addresses to news editors -- they are not intended for publication to the "end users" (the populations who read the news); as a result, DOCs of this type tend to contain obscure abbreviations and phrases, which are familiar to news editors, but may be meaningless to the general public. We also find a lot of formulaic, repetitive content in DOCs of this type (contact phone numbers, etc). * other : This represents DOCs that clearly do not fall into any of the above types -- in general, items of this type are intended for broad circulation (they are not advisories), they may be topically coherent (unlike "multi" type DOCS), and they typically do not contain paragraphs or sentences (they aren't really "stories"); these are things like lists of sports scores, stock prices, temperatures around the world, and so on. The general strategy for categorizing DOCs into these four classes was, for each source, to discover the most common and frequent clues in the text stream that correlated with the three "non-story" types, and to apply the appropriate label for the ``type=...'' attribute whenever the DOC displayed one of these specific clues. When none of the known clues was in evidence, the DOC was classified as a "story". This means that the most frequent classification error will tend to be the use of `` type="story" '' on DOCs that are actually some other type. But the number of such errors should be fairly small, compared to the number of "non-story" DOCs that are correctly tagged as such. Note that the markup was applied algorithmically, using logic that was based on less-than-complete knowledge of the data. For the most part, the HEADLINE, DATELINE and TEXT tags have their intended content; but due to the inherent variability (and the inevitable source errors) in the data, users may find occasional mishaps where the headline and/or dateline were not successfully identified (hence show up within TEXT), or where an initial sentence or paragraph has been mistakenly tagged as the headline or dateline. DATA QUANTITIES --------------- The "docs" directory contains a set of plain-text tables (datastats_*) that describe the quantities of data by source and month (i.e. by file), broken down according to the four "type" categories. The overall totals for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data you get when the files are uncompressed; the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM; the "K-wrds" numbers are actually the number of Chinese characters (there is no notion of "space separated word tokens" in Chinese, and for these tallies, we are not counting ASCII or other non-Chinese characters in the data): Source #Files Gzip-MB Totl-MB K-wrds #DOCs afp_cmn 75 55 143 39180 101341 cna_cmn 188 1209 3077 xin_cmn 192 757 1847 540636 1158221 zbn_cmn 13 44 100 30006 45235 TOTAL The following tables present "K-wrds" (i.e. thousands of Chinese characters) and "#DOCS" broken down by source and DOC type: #DOCS Kwords TextKB advis afp_cmn 0 0 0 cna_cmn 8725 762 2788 xin_cmn 6553 690 2279 zbn_cmn 0 0 0 TOTAL 15278 1452 5067 multi afp_cmn 0 0 0 cna_cmn 34926 26228 86873 xin_cmn 11329 7337 23251 zbn_cmn 105 186 596 TOTAL 46360 33751 110720 other afp_cmn 0 0 0 cna_cmn 106825 41895 154420 xin_cmn 36900 12831 46218 zbn_cmn 279 128 443 TOTAL 144004 54854 201081 story afp_cmn 101341 39180 124503 cna_cmn 1864129 816280 2558442 xin_cmn 1103439 519775 1636650 zbn_cmn 44851 29692 92456 TOTAL 3113760 1404927 4412051 GENERAL PROPERTIES OF THE DATA ------------------------------ All of the data sets have been produced from bulk archives that were delivered to the LDC via internet transfer. As a result, we avoided many of the problems that commonly afflict newswire data that has been transmitted over modems. Still, both archives contained noticeable amounts of "noise" (unusable characters, null bytes, etc) which had to be filtered out for research use. Two of the corpus authors at the LDC, Ke Chen and Junbo Kong, are native speakers of Mandarin Chinese, and did extensive diagnosis to identify and eliminate unsuitable content in the original archival data. To some extent, this is an open-ended problem, and there may be kinds of error conditions that have gone unnoticed or untreated -- this is true of any large text collection -- but we have striven to assure that the characters presented in all files are in fact valid and displayable, and that the markup is fully SGML compliant. SOURCE-SPECIFIC PROPERTIES -------------------------- - AFP For this initial release of AFP Chinese news data, the attempt to classify articles into "story", "multi", "advis" and "other" did not receive as much attention as was given to other sources in the earlier releases. A rapid inspection of the data indicated that AFP does not publish "tabular" articles (listings of weather, stocks, sports scores, etc), so the "other" category is essentially non-existent; also, since the data are conveyed via the web, we do not see the kind of content that would fall under the "advis" category. It's likely that there may be a number of stories that should really be called "multi" but have not been identified as such. - CNA In the previous release, there were about 165 empty DOC elements (having no content with the TEXT tags). These DOC elements have been removed; 56 of the cna_cmn files from Edition 2 were affected. - Xinhua In preparing the new Xinhua content for this release, we noticed a single story being repeated verbatim at least once per day (with differences in date numerics only); on discovering that this repeated content first appeared in the xin_xmn_200302 data file (and occurred every day from that point on), we kept the first instance of the article, and removed all subsequent instances. This is the only case where content from the previous release has been removed from the current one. README file written by David Graff and Ke Chen, January 2003 Updated for the Second Edition by Junbo Kong and Kazuaki Maeda, June 2005. Linguistic Data Consortium