Chinese News Translation Text Part 1
Item Name: | Chinese News Translation Text Part 1 |
Author(s): | Xiaoyi Ma |
LDC Catalog No.: | LDC2005T06 |
ISBN: | 1-58563-329-1 |
ISLRN: | 008-710-816-829-0 |
DOI: | https://doi.org/10.35111/9n1n-0q43 |
Release Date: | March 15, 2005 |
Member Year(s): | 2005 |
DCMI Type(s): | Text |
Data Source(s): | newswire |
Project(s): | GALE, TIDES |
Application(s): | cross-lingual information retrieval, language teaching, machine translation |
Language(s): | English, Mandarin Chinese |
Language ID(s): | eng, cmn |
License(s): |
LDC User Agreement for Non-Members |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Ma, Xiaoyi. Chinese News Translation Text Part 1 LDC2005T06. Web Download. Philadelphia: Linguistic Data Consortium, 2005. |
Introduction
Chinese News Translation Text Part 1 was developed by the Linguistic Data Consortium (LDC) and contains approximately 474,000 characters of Chinese text and corresponding English translations, totalling approximately 285,000 words.
All the stories in this corpus were collected and all translations made as Machine Translation (MT) training data for DARPA's Translingual Information Detection Extraction and Summarization (TIDES) program. They were selected and translated in different LDC projects during the time period of February 2003 to January 2005. Translation services were provided by seven translation agencies following roughly the same guidelines and procedures, and each Chinese news story was translated once.
Data
Two sources of journalistic Chinese text were selected to provide the Chinese material, collected from July 2002 - September 2002, and from April 2004 - August 2004:
- Agence France-Presse News Service: 580 news stories
- Xinhua News Service: 421 news stories
- Total: 1001 stories
The original source files used GB encoding for the Chinese characters. They also used SGML tags for marking sentence and paragraph boundaries and other information about each story. To make things easier for translators, nearly all SGML tags were removed, or replaced by "plain text" markers.
Each translation team was provided with translation guidelines. The translation guidelines were modified several times during the development of these data. Each team began with five stories which were checked for quality before taking on larger amounts of data. Subsequent translation submissions were continuously monitored for conformance and quality.
For the present release, the corpus content is organized into "source" and "translation" directories. The source directory and each of the human translation subdirectories contain 1,001 files, one news story per file. Corresponding file names are identical in the translation directory. The source and translation files are offered in SGML format.
Samples
For an example of the data in this corpus, please examine this translation sample (TXT).
Updates
None at this time.