Multiple-Translation Chinese (MTC) Part 4
|Item Name:||Multiple-Translation Chinese (MTC) Part 4|
|LDC Catalog No.:||LDC2006T04|
|Release Date:||January 15, 2006|
|Application(s):||machine translation, natural language processing, standards|
|Language(s):||English, Mandarin Chinese|
|Language ID(s):||eng, cmn|
LDC User Agreement for Non-Members
|Online Documentation:||LDC2006T04 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Ma, Xiaoyi. Multiple-Translation Chinese (MTC) Part 4 LDC2006T04. Web Download. Philadelphia: Linguistic Data Consortium, 2006.|
Multiple-Translation Chinese (MTC) Part 4 was developed by the Linguistic Data Consortium (LDC) and contains 100 Chinese newswire source files and their translations by four human translator teams and 11 Machine Translation (MT) systems, totalling 1,500 translation files, and also assessments for more than 11,000 segments of the MT output. Of the MT systems, five were commercial-off-the-shelf systems (COTS) and six were participants in the TIDES 2003 MT Evaluation. Of the COTS systems, two were free web-based services and three were commercial software. For this corpus, LDC assessed the output from all the TIDES participants' MT systems and one of the COTS systems.
To determine if automatic evaluation systems, such as BLEU, track human assessment, LDC also performed human assessments on one COTS output and the six TIDES research systems. The corpus includes the assessment results for one of the five COTS systems, the assessment results for the six TIDES research systems, and the specifications used for conducting the assessments.
The table below has a breakdown of the text files by source:
|Xinhua News Agency||50||19,650|
|Agence France Presse||50||22,450|
For the Chinese data, there are approximately 21 K-words, while the English translations total 396 K-words and 16K unique words.
The original source files used GB-2312 encoding for the Chinese characters, and SGML tags for marking sentence and paragraph boundaries and other information about each story. The character encoding is unaltered. To facilitate translation, nearly all SGML tags were removed or replaced by "plain text" markers. The markers were intended to assure that the resulting translations would be easily alignable to the source texts, so extra care was taken to ensure that they would be kept intact and properly oriented. Some normalization was performed on all files to conform to this format, including splitting long segments into smaller chunks and adding segment markers.
As a last step, all files were converted from UNIX-style line termination (new-line only) to MS-DOS-style (carriage-return plus line-feed) on the assumption that most (possibly all) translators would use MS-Windows-based editors.
Human Translation: The human translation teams were required to submit an initial set of five stories for quality evaluation, and after the initial feedback continued with the rest of the assigned stories. For the rest of the stories, their translations were continuously monitored for adherence to guidelines and quality assurance.
Machine Translation: Starting from the original SGML text format, special alterations were made to the files on an as-needed basis, so that they would be accepted and handled correctly by the various systems. Also, the systems differed in terms of the input and retrieval methods required to submit the source data for translation and to save the translated text in alignable form.
Human Assessment: The goal of this effort was to evaluate the quality of TIDES research, human translation teams, and COTS systems. Translations were evaluated on the basis of adequacy and fluency. Adequacy refers to the degree to which the translation communicates information present in the original source language text. Fluency refers to the degree to which the translation is well-formed according to the grammar of the target language.
For an example of the data provided in this corpus, please review the following samples:
None at this time.