Documentation for TREC Mandarin

Introduction

This publication contains the TREC ("Text REtreival Conference") Mandarin Corpus produced by the Linguistic Data Consortium (LDC); catalog number LDC2000T52, isbn 1-58563-178-7. These documents were used for the Chinese task in TRECs 5-6 and consist of approximately 170 megabytes of articles drawn from the People's Daily newspaper and the Xinhua newswire formatted to include TREC document ids. The text is Mandarin Chinese and is encoded using GB encoding scheme. The topics (questions) and relevance judgments (right answers) that complete the test collections can be downloaded from the Data/Non-English section of the TREC web site.

The Mandarin Chinese text data is from the Xinhua News Agency and the People's Daily News Service (both from mainland China). This collection of text was originally gathered by the Linguistic Data Consortium (LDC), and then adapted by the National Institute of Standards and Technology (NIST) for use in the TREC Mandarin evaluation program.

Additional information, updates, and other addenda may be available at the LDC catalog entry for this corpus at LDC2000T52.

Data Structure and Problems

The peoples-daily data consists of 36 data files, covering the period January 1991 through December 1993 (inclusive, one data file per month). The xinhua content consists of 41 files, covering the period April 1994 through September 1995 (normally there were between one and three files per month however there are three months are missing: November 1994, December 1994, and June 1995).

Please look at file.tbl for the directory structure of this publication, as well as a complete list of files.

All data are provided in the form of SGML-tagged text files, where the SGML tags are presented as 8-bit ASCII-encoded text strings, and the Mandarin Chinese text content is presented as 16-bit GB-encoded text. Validation files, pd.dtd and xinhua.dtd are provided. These files define the SGML tagging for each source. These dtd files can be used with an SGML parser to process the data. (Note that the SGML tagging is fairly simple in design, so that formal SGML parsing is not necessary; a wide range of general-purpose text processing methods can also be easily applied to these files.)

In the process of creating this publication, a number of problems came to light involving the data content. The nature and causes of the problems were found to vary according to the source (Xinhua or People's Daily) and according to particular file sets within each source. These problems are discussed in detail below, and are tabulated in:

pd-char-err.summary
pd-missed-boundary.summary
pd-char-err.log

xh-char-err.summary
xh-char-err.log

Because TREC evaluations had already been done using the text data as distributed by NIST (prior to the creation of this publication), it was decided to keep the data in its original form-despite the problems with data content-so that earlier test runs could be replicated consistently using this publication and future benchmark test results on this distribution would be fully comparable to earlier results based on the ftp data set. (Ideally, a completely fresh data set, without content problems, should be made available to establish a new set of benchmark performance standards, and perhaps this will happen, but this goes beyond the goals of the present publication.)

In the most general terms, there were two types of problems in the original ftp distribution:

incorrect placement of some SGML tags
invalid (meaningless) character values in the text content

The SGML tag errors occurred only in the Xinhua data. We determined that these could be corrected without altering the actual text content of the affected files, and we therefore decided to fix these problems in the present publication. These changes to the Xinhua files should have no effect on TREC evaluation metrics.

The presence of invalid characters in the text data affects both Xinhua and People's Daily, but in different ways. Nothing has been done to correct or eliminate the bad character data; instead, we are providing a complete tabulation of these problems, as well as an overall description in the following sections.

Description of SGML tag errors in Xinhua data

In the original ftp distribution, 36 of the 41 Xinhua data files contained tagging errors of the following general form:

<DOC>
<DOCID> ... </DOCID>
<DOCNO> ... </DOCNO>
<DATE> ... </DATE>
<TEXT>
[text string containing ASCII and/or GB characters]
</p>
...

This pattern affected a total of about 440 stories in the 36 files. When the line of text immediately following the <TEXT> tag contained ASCII strings, these strings should have been included within the content of the preceding <DATE> tag. When the line included GB characters, this string of GB text should have received one of the following correct treatments:

<TEXT>
<headline> [GB character string] </headline>
...

or

<TEXT>
<p>
<s> [GB character string] </s>
...

For this release of the TREC Mandarin data, the appropriate repairs have been made in all these cases. None of the actual text content has been removed or altered in making these repairs; rather, we have altered the placement of some text relative to existing SGML tags, and have added some SGML tags as needed, so that the files conform to the tagging syntax defined by the xinhua.dtd.

Description of invalid character data in Xinhua files

The original (pre-SGML) form of Xinhua data was received by the LDC via a low-speed (1200 or 2400 baud) modem over a standard phone line. There was occasional noise interference and intermittent disruption of the modem connection, which caused undefined byte values to be included in the text capture; also, the host computer at the Xinhua news service sometimes produced unexpected characters as part of the data stream being transmitted.

As a result, 20 of the Xinhua files contain a total of 208 bytes that are not interpretable either as printable ASCII or as part of a valid 16-bit GB character code -- these bad byte values affect a total of 65 stories. A byte value of 0xFF (decimal 255) appears most frequently (44 occurrences, affecting 8 files), and it appears that this byte value should be treated as if it were part of a 16-bit GB code, even though it cannot be part of a valid GB character. In other words, if we remove just the individual 0xFF byte from within a GB text string, the subsequent bytes would be paired incorrectly for interpretation as GB text; to remove instances of 0xFF bytes with minimal damage to subsequent text, the line containing 0xFF must first be parsed into 2-byte GB characters, and the byte-pair containing 0xFF should be deleted from the string.

The next most frequent case is the ASCII "bell" byte code (0x07, 18 occurrences affecting 5 files); these were found to occur in groups (e.g. three bell characters in a row), and were usually caused by system operator events on the Xinhua host computer (e.g. "Broadcast Message from root...") -- and the text of the broadcast is part of the captured data, interspersed with the Chinese news text.

The remaining anomalous bytes are scattered fairly evenly in the ranges of 0x01 - 0x1E (the ASCII "control" codes) and 0x7F - 0x9F. A complete sgml-parser error log, xh-char-err.log, list all the anomalous bytes, giving the file, line number, byte offset on the line, and the DOCNO of the affected story. A summary of the errors, xh-char-err.summary, is also provided, listing the DOCNOS of stories that contain one or more errors, along with the total number of bad bytes in each story, and the particular byte values that occurred. For example, this line from the summary:

x9501003 CB016023-BFJ-372-225 4 12 90 FF

refers to a story identified by the DOCNO tag "CB016023-BFJ-372-225", which is found in data file "x9501003"; this story contains a total of four unusable bytes, having the hexadecimal values 12, 90 and FF (one of these values shows up twice in the story -- the log file contains complete information on the locations of these bytes in the data).

Description of invalid character data in People's Daily files

The original (pre-SGML) form of People's Daily data was received by the LDC via a large number of MS-DOS floppy disks. It appears that the SGML tagging in the TREC collection was produced by a format conversion program that had been developed on the basis of inspecting the first month's worth of data, but its performance was not checked when it was applied to all the data that followed.

Based on the distribution of invalid character data in the resulting SGML files, it seems that there were some minor fluctuations in the format of the pre-SGML files during the 1991-1992 collection period; roughly half of these files have seven or fewer character errors per file, and the other half (mostly in 1992) have a few dozen character errors (up to 75 errors in one file). Note that nine of the twelve "pd92*" files contain null bytes -- the number of null bytes per file ranges from 5 (in "pd9207.sgml") to 51 (in "pd9204.sgml").

The pre-SGML files from 1993 apparently introduced some important format differences that went untreated in the conversion to SGML, with the result that there are, on average, about 3000 character errors in each of the "pd93*.sgml" files -- at least one unusable byte code in almost every story in this portion of the collection. (Actually, only 414 stories out 36924 in the 1993 data files contain no character errors.) We have provided both a complete log of all the invalid character occurrences and a summary by file and DOCNO pd-char-err.log and pd-char-err.summary.

The vast majority of 1993 stories contain exactly one unusable byte, and the value of this byte is usually 0xFF or 0x7F; unlike the Xinhua data, the 0xFF byte value in People's Daily files appears to have been used originally as a story-boundary marker, and is therefore isolated from (not part of) GB text strings. In other words, each individual 0xFF byte should be deleted or ignored by itself -- not as part of a GB byte-pair -- and this will not misalign the subsequent GB data. (Actually, the 0xFF bytes are typically found outside the <TEXT> portion of the <DOC> unit, e.g. in the content of the <HL> tag; but see below regarding other occurrences.

Missed story boundaries in People's Daily files

There is one other aspect of the 1993 format differences that may have a more serious impact on the quality of the data (and it's suitability for information-retrieval research). After finding that the 0xFF bytes were used in the original raw data as part of the story boundary demarcation, we discovered that these bytes sometimes appeared one or more times within the <TEXT> portion of stories. Looking more closely at these instances, we found that other story boundary cues were present within the <TEXT> portion of many stories as well. In other words, the SGML conversion, when applied to 1993 People's Daily data, missed a number of actual story boundaries, and combined two or more distinct stories together within the <TEXT> portion of a single <DOC> unit.

This problem affects 1540 (about 4%) of the <DOC> units in the 1993 files; in these problem <DOC> units, a total of 4700 story boundaries are missed. In the majority of cases, only one or two boundaries are missed within a single <DOC>; there are 26 <DOC> units that each contain 10 or more missed boundaries; the largest number of missed boundaries in a single <DOC> is 17.

The visible (and most reliable) indication of a missed story boundary is the ASCII string "#11993" occurring within the <TEXT> portion of a story. (Note, however, that some programs for displaying GB characters, such as "cxterm" in X-Windows, will not display the initial "#" character, because it is preceded by 0xFF.) We have provided a table pd-missed-boundary.summary that lists the file names and DOCNO's for the stories that show this symptom, along with the number of times this pattern appears within the <TEXT> portion of a given story.

Introduction

Data Structure and Problems

Content Copyright