Chinese English News Magazine Parallel Text
Item Name: | Chinese English News Magazine Parallel Text |
Author(s): | Xiaoyi Ma |
LDC Catalog No.: | LDC2005T10 |
ISBN: | 1-58563-333-X |
ISLRN: | 629-451-208-314-7 |
DOI: | https://doi.org/10.35111/28bx-hc14 |
Release Date: | June 15, 2005 |
Member Year(s): | 2005 |
DCMI Type(s): | Text |
Data Source(s): | newswire |
Project(s): | GALE, TIDES |
Application(s): | machine translation |
Language(s): | Mandarin Chinese |
Language ID(s): | cmn |
License(s): |
LDC User Agreement for Non-Members |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Ma, Xiaoyi. Chinese English News Magazine Parallel Text LDC2005T10. Web Download. Philadelphia: Linguistic Data Consortium, 2005. |
Related Works: | View |
Introduction
Chinese English News Magazine Parallel Text was developed by the Linguistic Data Consortium (LDC) and contains Chinese news stories (20 million characters) and their English translations (9 million words) aligned at sentence level.
The data consists of content from Sinorama Magazine, Taiwan, from 1976 to 2004 collected by LDC. It totals 6,366 story pairs and 365,568 sentence pairs.
Data
Sinorama Magazine is published monthly in several languages, including Chinese, English, Japanese. LDC received its 1976 to 2000 publications on a single CD, and its 2001 to 2004 publications via Sinorama's website.
The Sinorama Chinese text was encoded in Big5. The data came aligned by story but lacked sentence-level alignment, which was done at LDC using Champollion v 1.1.
The data directory contains subdirectories for Chinese documents, English documents, and the sentence level alignment. The English and Chinese files may contain one or more documents, with each document formatted in SGML. The documents are tagged with DOCIDs, and each segment (generally a sentence) in the document is given a numerical SEG ID starting at one for each document. The alignment files contain SGML formatted lines that map the English translations to their Chinese counterparts by specifying segment IDs in the form of EnglishSegId= "#" ChineseSegId= "#". The EnglishSegId and ChineseSegId fields may have none, one, or more than one segment ID.
Samples
The following files provide an example of this corpus:
Updates
None at this time.
Copyright
Portions © 1976-2004 Sinorama MagazinePortions © 2005 Trustees of the University of Pennsylvania