tags, only and
* the tags are always assigned sequential numeric ID's
starting at 1 for the first of each file, are always
placed on the same line with their contents, and are always
separated from the contents by a space.
The content of the translation files is identical to the source files
except:
* the initial tag contains an additional attribute:
where the sys_id string enclosed in double quotes matches the name
of the directory containing the file
* the contents of the tags are plain ASCII English text,
although most of the automatic MT systems included some strings of
untranslated GB character data in their output, and these are
retained as-is (see "trans_team.info" for more details)
To verify that all the files conformed to this SGML specification, and
were fully alignable at the level of segments, a custom validation
script (validate.perl) was written to perform a rigorous check across
the entire corpus. The script produced four output listings:
* filelist.source: lists source files and segments per file
* filelist.translation: lists translation files and segments per file
* validate.log: complete tabulation of segment sizes
* validate.err: lists empty translation segments (path/doc_id,seg_id)
and files containing any untranslated Chinese text
Each line of the validate.log file represents one segment in the set
of 105 stories (there are a total of 993 segments). The columns
provide the file name (doc_id), the segment number (seg_id), and for
each version of the file (source and 17 translations), the number of
bytes and number of space-separated tokens found in that version of
the segment. Column headings are provided in the first line of the
log file, and each line is about 192 characters wide.
The validate.err file reports two kinds of problems in the translated
files:
* no text in a segment
* one or more segments with untranslated Chinese/GB content in a file
There are 22 occurrences of the former problem, and 466 occurrences of
the latter; most of these are due to the machine-translation system
outputs.
Ranking of Manual Translations:
At the point when only the Xinhua translations had been received from
the various translation services, an initial ranking was performed by
two LDC personnel, one a Chinese-dominant bilingual and the other an
English-dominant bilingual. There was overall agreement on the
ranking between the two and minor discrepancies were resolved through
discussion and comparison of additional files. This initial ranking
among the manual translations is:
best worst
ta0 > ta4 > ta1 > ta2 > ta3 > ta5 > ta6 > ta7 > ta8 > ta9 > tb0
The ranking method was unstructured and somewhat casual -- it is not
intended to be definitive, or even accountable. A more systematic
assessment of translation quality, using 10 judges and a formal
protocol, is to begin as this data set is published, and the results
of that assessment will be released subsequently.
------------------------------------
David Graff, graff@ldc.upenn.edu
Shudong Huang, shudong@ldc.upenn.edu
January 24, 2002