ISI Chinese-English Automatically Extracted Parallel Corpus

This distribution contains a corpus of Chinese-English parallel sentences, which were extracted automatically from two monolingual corpora: Chinese Gigaword 2 (LDC catalog number LDC2006T02) and English Gigaword 2 (LDC catalog number LDC2005T12).  The data was extracted from news articles published by the Xinhua News agency and was obtained using the automatic parallel sentence identification method described in the following publication:
Dragos Stefan Munteanu, Daniel Marcu, 2005. Improving Machine Translation Performance by Exploiting Non-parallel Corpora, Computational Linguistics, 31(4):477-504 (a preliminary version can be found at http://www.isi.edu/~dragos/Docs/MunteanuMarcu_CL_2005.pdf, and the final version at http://portal.acm.org/citation.cfm?id=1110825.1110828)

The corpus contains 558,567 sentence pairs; the word count on the English side is approximately 16M words. The sentences in the parallel corpus preserve the form and encoding of the texts in the original Gigaword corpora.

For each sentence pair in the corpus we provide the names of the documents from which the two sentences were extracted, as well as a confidence score (between 0.5 and 1.0), which is  indicative of their degree of parallelism. The parallel sentence identification approach is designed to judge sentence pairs in isolation from their contexts, and can therefore find parallel sentences within document pairs which are not parallel. The fact that two documents share several parallel sentences does not necessarily mean the documents are parallel.

In order to make this resource useful for research in Machine Translation, we made efforts to detect potential overlaps between this data and the standard test and development data sets used by the MT community. The NIST 2002-2005 MT evaluation data sets contain several articles from Xinhua News. Sentence pairs in our distribution that have a 7-gram overlap with a sentence pair in a NIST MT evaluation set or sentence pairs coming from documents whose names are similar to those in the NIST MT sets are marked with a negative confidence score.

The distribution consists of 5 files:
- ISI_chi_eng_parallel_corpus.chi, ISI_chi_eng_parallel_corpus.eng: files containing the parallel text.
- ISI_chi_eng_parallel_corpus.chi.doc, ISI_chi_eng_parallel_corpus.eng.doc: metadata files indicating, for each sentence in the parallel corpus, the ID of the document from which it originated (the IDs are those used in the Gigaword 2 corpora).
- ISI_chi_eng_parallel_corpus.score: metadata file indicating a confidence score for each sentence pair in the corpus.  For sentences that overlap with the standard MT test sets, the scores are negative numbers (that is, zero minus the original score).


Below are several example sentence pairs from the corpus, together with their confidence scores.

他指责中国至今不愿采取必要措施保护美国的知识产权,并称中国拒绝对美国电影、录相及音像产品开放市场
Kantor is quoted as saying that "to date, China has been unwilling to take the necessary steps to protect American intellectual property rights."
(Confidence: 0.934114)

经贸部发言人就此向记者发表谈话指出,美方这一做法是不顾中国在保护知识产权方面所取得的重大进展以及中国政府在双边磋商中表现出的诚意和灵活性的行为
With such an intention in mind, the United States is trying to negate the great efforts and marked progress made by China in protecting intellectual property rights.
(Confidence: 0.819396)

国家旅游局局长刘毅在致辞中介绍说,改革开放的中国,旅游业象朝阳一样,较快地发展成为新兴的重点产业
Liu noted that China has become one of the important tourism destinations in the world and its tourism industry flourished over the past years "like the rising sun."
(Confidence: 0.748064)

(记者余瑛瑞)以厦门高崎机场为核心组建的厦门国际航空港集团有限公司今天正式挂牌,这是继深圳之后中国大陆第二个国际航空港集团公司
(Group) Ltd., the second of its kind in China after Shenzhen Airport, was established today in Xiamen, a coastal city in east China's Fujian Province.
(Confidence: 0.675459)

美国提出的报复清单是中国政府绝对不能接受的
And the Chinese side would certainly not accept the unreasonable demands put forward by the Americans concerning the protection of intellectual property rights.
(Confidence: 0.530623)


To illustrate the file format, we list below the first 5 lines from each file in the distribution:

ISI_chi_eng_parallel_corpus.eng
Trade Representative Mickey Kantor announced Saturday that Washington would impose trade sanctions on Chinese goods worth 2.8 billion U.S. dollars if China could not meet the demands raised by the U.S. for the protection of intellectual property rights.
Kantor is quoted as saying that "to date, China has been unwilling to take the necessary steps to protect American intellectual property rights."
With such an intention in mind, the United States is trying to negate the great efforts and marked progress made by China in protecting intellectual property rights.
And the Chinese side would certainly not accept the unreasonable demands put forward by the Americans concerning the protection of intellectual property rights.
In fact, China has completed its legislation on the protection of intellectual property rights within a dozen years, which had actually taken some developed countries several decades--or even a century--to do so.

ISI_chi_eng_parallel_corpus.eng.doc
XIN_ENG_19950101.0059
XIN_ENG_19950101.0066
XIN_ENG_19950101.0066
XIN_ENG_19950101.0066
XIN_ENG_19950101.0066

ISI_chi_eng_parallel_corpus.chi
(记者应谦)美国贸易代表坎特31单方面宣布,如果中国在明年24日之前不能满足美方提出的有关保护美国知识产权的要求,美国将对中国实行贸易制裁
他指责中国至今不愿采取必要措施保护美国的知识产权,并称中国拒绝对美国电影、录相及音像产品开放市场
经贸部发言人就此向记者发表谈话指出,美方这一做法是不顾中国在保护知识产权方面所取得的重大进展以及中国政府在双边磋商中表现出的诚意和灵活性的行为
美国提出的报复清单是中国政府绝对不能接受的
,中国用短短十几年时间,完成了一些发达国家通常需要几十年甚至上百年才能完成的立法路程,建立了比较完整的知识产权保护体系

ISI_chi_eng_parallel_corpus.chi.doc
XIN_CMN_19950101.0001
XIN_CMN_19950101.0001
XIN_CMN_19950101.0002
XIN_CMN_19950101.0002
XIN_CMN_19950101.0002

ISI_chi_eng_parallel_corpus.score
0.807777
0.934114
0.819396
0.530623
-0.969935