Multiple-Translation Chinese (MTC) Part 3

Item Name: Multiple-Translation Chinese (MTC) Part 3
Author(s): Xiaoyi Ma
LDC Catalog No.: LDC2004T07
ISBN: 1-58563-289-9
ISLRN: 026-006-085-012-3
DOI: https://doi.org/10.35111/9nxq-9e06
Release Date: July 12, 2004
Member Year(s): 2004
DCMI Type(s): Text
Data Source(s): newswire
Project(s): GALE, TIDES
Application(s): cross-lingual information retrieval, language teaching, machine translation
Language(s): English, Mandarin Chinese
Language ID(s): eng, cmn
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2004T07 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Ma, Xiaoyi. Multiple-Translation Chinese (MTC) Part 3 LDC2004T07. Web Download. Philadelphia: Linguistic Data Consortium, 2004.
Related Works: View

Introduction

Multiple-Translation Chinese (MTC) Part 3 was developed by the Linguistic Data Consortium (LDC) and contains approximately 21,000 words of Chinese newswire with their translations by four different translation teams, totaling approximately 100,000 English words.

This corpus is the third part of a line of corpora created to support the development of automatic means for evaluating translation quality. The other corpora in this collection are:

All four parts contain unique source texts. The first part contains multiple human translations and machine translations (MT) of the source text, and Parts 2 and 4 contain multiple human and machine translations along with MT assessment. This corpus, Part 3, contains only source text and four sets of human translation. For the first part, 11 translation teams were selected to create the human translations, and for the rest of the parts, the four best teams from the original 11 were selected to create translations.

To support the development of automatic means for evaluating translation quality, LDC was sponsored to solicit four sets of human translations for a single set of Mandarin Chinese source materials.

Data

The data was drawn from two sources of journalistic Mandarin Chinese text, AFP News Service and Xinhua News Service. The text was drawn from the May and June 2002 collection of both sources.

The story selection from the two newswire collections was controlled by story length: all selected stories contain between about 230 and 564 Chinese characters. The overall count of Chinese characters by source is shown in the following table:

Source Stories Chinese Characters
AFP 50 22,135
Xinhua 50 20,321
Total 100 42,456

For the Chinese data, there are approximately 21 K-words (thousands of words), while for the English translation, there are approximately 100 K-words in total, and 12K unique words.

In accordance with the guidelines, each translation team was asked to return the first 10 Xinhua stories for quality checking. This was to ensure that each translation team had indeed understood and was following the guidelines, and the translation quality was acceptable. LDC sent the translations back to the translation team for any deviations from the guidelines or any quality issues detected.

Subsequent translation submissions were continuously monitored for conformance and quality. Once the full set of translations was complete, a final pass of reformatting and validation was carried out, to assure alignability of segments, and to convert the translated texts into SGML format.

Each translation team was also asked to fill out and return a questionnaire to describe their procedures and professional background.

Samples

For examples of the data in this corpus, please view these Chinese (CHN) and an English (ENG) samples.

Updates

There are no updates available at this time.

Available Media

View Fees





Login for the applicable fee