Multiple-Translation Chinese (MTC) Part 3
Item Name: | Multiple-Translation Chinese (MTC) Part 3 |
Author(s): | Xiaoyi Ma |
LDC Catalog No.: | LDC2004T07 |
ISBN: | 1-58563-289-9 |
ISLRN: | 026-006-085-012-3 |
DOI: | https://doi.org/10.35111/9nxq-9e06 |
Release Date: | July 12, 2004 |
Member Year(s): | 2004 |
DCMI Type(s): | Text |
Data Source(s): | newswire |
Project(s): | GALE, TIDES |
Application(s): | cross-lingual information retrieval, language teaching, machine translation |
Language(s): | English, Mandarin Chinese |
Language ID(s): | eng, cmn |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2004T07 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Ma, Xiaoyi. Multiple-Translation Chinese (MTC) Part 3 LDC2004T07. Web Download. Philadelphia: Linguistic Data Consortium, 2004. |
Related Works: | View |
Introduction
Multiple-Translation Chinese (MTC) Part 3 was developed by the Linguistic Data Consortium (LDC) and contains approximately 21,000 words of Chinese newswire with their translations by four different translation teams, totaling approximately 100,000 English words.
This corpus is the third part of a line of corpora created to support the development of automatic means for evaluating translation quality. The other corpora in this collection are:
- Multiple-Translation Chinese Corpus (LDC2002T01)
- Multiple-Translation Chinese (MTC) Part 2 (LDC2003T17)
- Multiple-Translation Chinese (MTC) Part 4 (LDC2006T04)
All four parts contain unique source texts. The first part contains multiple human translations and machine translations (MT) of the source text, and Parts 2 and 4 contain multiple human and machine translations along with MT assessment. This corpus, Part 3, contains only source text and four sets of human translation. For the first part, 11 translation teams were selected to create the human translations, and for the rest of the parts, the four best teams from the original 11 were selected to create translations.
To support the development of automatic means for evaluating translation quality, LDC was sponsored to solicit four sets of human translations for a single set of Mandarin Chinese source materials.
Data
The data was drawn from two sources of journalistic Mandarin Chinese text, AFP News Service and Xinhua News Service. The text was drawn from the May and June 2002 collection of both sources.
The story selection from the two newswire collections was controlled by story length: all selected stories contain between about 230 and 564 Chinese characters. The overall count of Chinese characters by source is shown in the following table:
Source | Stories | Chinese Characters |
---|---|---|
AFP | 50 | 22,135 |
Xinhua | 50 | 20,321 |
Total | 100 | 42,456 |
For the Chinese data, there are approximately 21 K-words (thousands of words), while for the English translation, there are approximately 100 K-words in total, and 12K unique words.
In accordance with the guidelines, each translation team was asked to return the first 10 Xinhua stories for quality checking. This was to ensure that each translation team had indeed understood and was following the guidelines, and the translation quality was acceptable. LDC sent the translations back to the translation team for any deviations from the guidelines or any quality issues detected.
Subsequent translation submissions were continuously monitored for conformance and quality. Once the full set of translations was complete, a final pass of reformatting and validation was carried out, to assure alignability of segments, and to convert the translated texts into SGML format.
Each translation team was also asked to fill out and return a questionnaire to describe their procedures and professional background.
Samples
For examples of the data in this corpus, please view these Chinese (CHN) and an English (ENG) samples.
Updates
There are no updates available at this time.