Chinese News Translation Text Part 1

Item Name: Chinese News Translation Text Part 1
Author(s): Xiaoyi Ma
LDC Catalog No.: LDC2005T06
ISBN: 1-58563-329-1
ISLRN: 008-710-816-829-0
DOI: https://doi.org/10.35111/9n1n-0q43
Release Date: March 15, 2005
Member Year(s): 2005
DCMI Type(s): Text
Data Source(s): newswire
Project(s): GALE, TIDES
Application(s): cross-lingual information retrieval, language teaching, machine translation
Language(s): English, Mandarin Chinese
Language ID(s): eng, cmn
License(s): LDC User Agreement for Non-Members
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Ma, Xiaoyi. Chinese News Translation Text Part 1 LDC2005T06. Web Download. Philadelphia: Linguistic Data Consortium, 2005.

Introduction

Chinese News Translation Text Part 1 was developed by the Linguistic Data Consortium (LDC) and contains approximately 474,000 characters of Chinese text and corresponding English translations, totalling approximately 285,000 words.

All the stories in this corpus were collected and all translations made as Machine Translation (MT) training data for DARPA's Translingual Information Detection Extraction and Summarization (TIDES) program. They were selected and translated in different LDC projects during the time period of February 2003 to January 2005. Translation services were provided by seven translation agencies following roughly the same guidelines and procedures, and each Chinese news story was translated once.

Data

Two sources of journalistic Chinese text were selected to provide the Chinese material, collected from July 2002 - September 2002, and from April 2004 - August 2004:

  • Agence France-Presse News Service: 580 news stories
  • Xinhua News Service: 421 news stories
  • Total: 1001 stories

The original source files used GB encoding for the Chinese characters. They also used SGML tags for marking sentence and paragraph boundaries and other information about each story. To make things easier for translators, nearly all SGML tags were removed, or replaced by "plain text" markers.

Each translation team was provided with translation guidelines. The translation guidelines were modified several times during the development of these data. Each team began with five stories which were checked for quality before taking on larger amounts of data. Subsequent translation submissions were continuously monitored for conformance and quality.

For the present release, the corpus content is organized into "source" and "translation" directories. The source directory and each of the human translation subdirectories contain 1,001 files, one news story per file. Corresponding file names are identical in the translation directory. The source and translation files are offered in SGML format.

Samples

For an example of the data in this corpus, please examine this translation sample (TXT).

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee