ISI Chinese-English Automatically Extracted Parallel
Corpus
This distribution contains a corpus of Chinese-English parallel sentences,
which were extracted automatically from two monolingual corpora: Chinese Gigaword 2 (LDC catalog number LDC2006T02) and English Gigaword 2 (LDC catalog number LDC2005T12). The data was extracted from news articles
published by the Xinhua News agency and was obtained
using the automatic parallel sentence identification method described in the
following publication:
Dragos Stefan Munteanu, Daniel Marcu, 2005. Improving
Machine Translation Performance by Exploiting Non-parallel Corpora,
Computational Linguistics, 31(4):477-504 (a preliminary version can be found at
http://www.isi.edu/~dragos/Docs/MunteanuMarcu_CL_2005.pdf, and the final
version at http://portal.acm.org/citation.cfm?id=1110825.1110828)
The corpus contains 558,567 sentence pairs; the word count on the English side
is approximately 16M words. The sentences in the parallel corpus preserve the
form and encoding of the texts in the original Gigaword
corpora.
For each sentence pair in the corpus we provide the names of the documents from
which the two sentences were extracted, as well as a confidence score (between
0.5 and 1.0), which is
indicative of their degree of parallelism. The parallel sentence
identification approach is designed to judge sentence pairs in isolation from
their contexts, and can therefore find parallel sentences within document pairs
which are not parallel. The fact that two documents share several parallel
sentences does not necessarily mean the documents are parallel.
In order to make this resource useful for research in Machine Translation, we
made efforts to detect potential overlaps between this data and the standard
test and development data sets used by the MT community. The NIST 2002-2005 MT
evaluation data sets contain several articles from Xinhua
News. Sentence pairs in our distribution that have a 7-gram overlap with a
sentence pair in a NIST MT evaluation set or sentence pairs coming from
documents whose names are similar to those in the NIST MT sets are marked with
a negative confidence score.
The distribution consists of 5 files:
- ISI_chi_eng_parallel_corpus.chi, ISI_chi_eng_parallel_corpus.eng: files containing the
parallel text.
- ISI_chi_eng_parallel_corpus.chi.doc, ISI_chi_eng_parallel_corpus.eng.doc:
metadata files indicating, for each sentence in the parallel corpus, the ID of
the document from which it originated (the IDs are those used in the Gigaword 2 corpora).
- ISI_chi_eng_parallel_corpus.score: metadata file
indicating a confidence score for each sentence pair in the corpus. For sentences that overlap with the standard
MT test sets, the scores are negative numbers (that is, zero minus the original
score).
Below are several example sentence pairs from the corpus, together with their
confidence scores.
他指责中国“至今不愿采取必要措施保护美国的知识产权”,并称中国拒绝对美国电影、录相及音像产品开放市场。
Kantor is quoted as saying that "to date,
(Confidence: 0.934114)
外经贸部发言人就此向记者发表谈话指出,美方这一做法是不顾中国在保护知识产权方面所取得的重大进展以及中国政府在双边磋商中表现出的诚意和灵活性的行为。
With such an intention in mind, the
(Confidence: 0.819396)
国家旅游局局长刘毅在致辞中介绍说,改革开放的中国,旅游业象朝阳一样,较快地发展成为新兴的重点产业。
Liu noted that
(Confidence: 0.748064)
(记者余瑛瑞)以厦门高崎机场为核心组建的厦门国际航空港集团有限公司今天正式挂牌,这是继深圳之后中国大陆第二个国际航空港集团公司。
(Group) Ltd., the second of its kind in
(Confidence: 0.675459)
美国提出的报复清单是中国政府绝对不能接受的。
And the Chinese side would certainly not accept the unreasonable demands put
forward by the Americans concerning the protection of intellectual property
rights.
(Confidence: 0.530623)
To illustrate the file format, we list below the first 5 lines from each file
in the distribution:
ISI_chi_eng_parallel_corpus.eng
Trade Representative Mickey Kantor announced
Saturday that
Kantor is quoted as saying that "to date,
With such an intention in mind, the
And the Chinese side would certainly not accept the unreasonable demands put
forward by the Americans concerning the protection of intellectual property
rights.
In fact,
ISI_chi_eng_parallel_corpus.eng.doc
XIN_ENG_19950101.0059
XIN_ENG_19950101.0066
XIN_ENG_19950101.0066
XIN_ENG_19950101.0066
XIN_ENG_19950101.0066
ISI_chi_eng_parallel_corpus.chi
(记者应谦)美国贸易代表坎特31日单方面宣布,如果中国在明年2月4日之前不能满足美方提出的有关保护美国知识产权的要求,美国将对中国实行贸易制裁。
他指责中国“至今不愿采取必要措施保护美国的知识产权”,并称中国拒绝对美国电影、录相及音像产品开放市场。
外经贸部发言人就此向记者发表谈话指出,美方这一做法是不顾中国在保护知识产权方面所取得的重大进展以及中国政府在双边磋商中表现出的诚意和灵活性的行为。
美国提出的报复清单是中国政府绝对不能接受的。
他说,中国用短短十几年时间,完成了一些发达国家通常需要几十年甚至上百年才能完成的立法路程,建立了比较完整的知识产权保护体系。
ISI_chi_eng_parallel_corpus.chi.doc
XIN_CMN_19950101.0001
XIN_CMN_19950101.0001
XIN_CMN_19950101.0002
XIN_CMN_19950101.0002
XIN_CMN_19950101.0002
ISI_chi_eng_parallel_corpus.score
0.807777
0.934114
0.819396
0.530623
-0.969935