ISI Arabic-English Automatically Extracted Parallel
Corpus
This distribution contains a corpus of Arabic-English parallel sentences, which
were extracted automatically from two monolingual corpora: Arabic Gigaword 2 (LDC catalog number LDC2006T02) and English Gigaword 2 (LDC catalog number LDC2005T12). The data was extracted from news articles
published by the Xinhua News and the Agence France Presse news
agencies and was obtained using the automatic parallel sentence identification
method described in the following publication:
Dragos Stefan Munteanu, Daniel Marcu, 2005. Improving
Machine Translation Performance by Exploiting Non-parallel Corpora,
Computational Linguistics, 31(4):477-504 (a preliminary version can be found at
http://www.isi.edu/~dragos/Docs/MunteanuMarcu_CL_2005.pdf, and the final
version at http://portal.acm.org/citation.cfm?id=1110825.1110828)
The corpus contains 1,124,609 sentence pairs; the word count on the English
side is approximately 31M words. The sentences in the parallel corpus preserve
the form and encoding of the texts in the original Gigaword
corpora.
For each sentence pair in the corpus we provide the names of the documents from
which the two sentences were extracted, as well as a confidence score (between
0.5 and 1.0), which is
indicative of their degree of parallelism. The parallel sentence
identification approach is designed to judge sentence pairs in isolation from
their contexts, and can therefore find parallel sentences within document pairs
which are not parallel. The fact that two documents share several parallel
sentences does not necessarily mean the documents are parallel.
In order to make this resource useful for research in Machine Translation, we
made efforts to detect potential overlaps between this data and the standard
test and development data sets used by the MT community. The NIST 2002-2005 MT
evaluation data sets contain several articles from Xinhua
News and Agence France Presse.
Sentence pairs in our distribution that have a 7-gram overlap with a sentence
pair in a NIST MT evaluation set or sentence pairs coming from documents whose
names are similar to those in the NIST MT sets are marked with a negative
confidence score.
The distribution consists of 5 files:
- ISI_ara_eng_parallel_corpus.ara, ISI_ara_eng_parallel_corpus.eng: files containing the
parallel text.
- ISI_ara_eng_parallel_corpus.ara.doc, ISI_ara_eng_parallel_corpus.eng.doc:
metadata files indicating, for each sentence in the parallel corpus, the ID of
the document from which it originated (the IDs are those used in the Gigaword 2 corpora).
- ISI_ara_eng_parallel_corpus.score: metadata file
indicating a confidence score for each sentence pair in the corpus. For sentences that overlap with the standard
MT test sets, the scores are negative numbers (that is, zero minus the original
score).
Below are several example sentence pairs from the corpus, together with their
confidence scores.
وقال الوزراء
ان توقف
العمليات
العسكرية
يجب ان
يرافقه فصل للقوات
وسحب الاسلحة
الثقيلة وانتشار قوة فصل
من جنود
الامم المتحدة.
"The agreement on cessation of hostilities must include the
separation of forces, the withdrawal of heavy weapons and the interposition of
UNPROFOR troops.
(Confidence: 0.973045)
ويشار الى ان
خطة التسوية
الاوروبية
تنص على
منح المسلمين
والكروات
51% من اراضي
البلاد ومنح الصرب
49% من الاراضي
مع اعلم
انهم يسيطرون
حاليا على 70% من
اراضي البلاد.
An existing European plan gives 51 percent of Bosnia-Hercegovina
to the Croats and Moslems, and 49 percent for the Serbs, who through their
war-gains currently control about 70 percent.
(Confidence: 0.840605)
واكد المتحدث
ان "لا
بد من
محاسبة كل من البيض وهيثم
قاسم طاهر
او مغادرتهم
البلاد" موضحا ان
"لا حاجة
الى تشكيل
حكومة انقاذ وطني
لان هناك
حكومة شرعية قائمة".
But the northern spokesman said there was "no need to form a national
unity government, since the legitimate government is already in place."
(Confidence: 0.773351)
ودعا الحزب
الاشتراكي
اليمني في مبادرته
ايضا الى
"الفصل بين القوات
المتواجهة
وسحب القوات
الى مواقعها
السابقة قبل الحرب"
بهدف "صيانة ما
تبقى من
القوات المسلحة".
Baid's Yemen Socialist Party (YSP) proposed the two
armies should separate and be "withdrawn to the positions they held before
the war."
(Confidence: 0.679145)
يلتقي وفد من
جامعة الدول العربية
في صنعاء
اليوم الجمعة الرئيس اليمني علي عبد
الله صالح
لمحاولة اقناعه بالموافقة
على وقف
لاطلاق النار في
المعارك مع خصومه
الجنوبيين.
Meanwhile, the Arab League held a meeting with Yemeni leaders in Sanaa to try to broker a ceasefire between Saleh and Baid.
(Confidence: 0.5093)
To illustrate the file formats, we list below the first 5 lines from each file
in the distribution:
ISI_ara_eng_parallel_corpus.eng
"It means that we are in mourning, or that we have given up land to
the enemy."
An existing European plan gives 51 percent of Bosnia-Hercegovina
to the Croats and Moslems, and 49 percent for the Serbs, who through their
war-gains currently control about 70 percent.
He spoke after representatives of the 51-state Organization of the Islamic
Conference (OIC) met in urgent session in
One-third of the load was allegedly delivered to
"This is the first weapons-related convoy from the highest level since the
war began.
ISI_ara_eng_parallel_corpus.eng.doc
AFP_ENG_19940512.0212
AFP_ENG_19940513.0090
AFP_ENG_19940513.0090
AFP_ENG_19940514.0033
AFP_ENG_19940512.0138
ISI_ara_eng_parallel_corpus.ara
وقال احد
المستوطنين
الثلاثة
"اننا نمزق ثيابنا
احتراما لتقاليدنا
وهذا يعني
اننا في
حداد او
اننا تخلينا
عن ارضنا
للعدو".
ويشار الى ان
خطة التسوية
الاوروبية
تنص على
منح المسلمين
والكروات
51% من اراضي
البلاد ومنح الصرب
49% من الاراضي
مع اعلم
انهم يسيطرون
حاليا على 70% من
اراضي البلاد.
وقبيل افتتاح الاجتماع اعرب سفراء
51 دولة في
منظمة الموءتمر الاسلامي عن اسفهم
"لابعاد"
المنظمة عن المفاوضات
الجارية حاليا واكدوا
من جديد
على مبدأ
سلامة ووحدة اراضي
البوسنة والهرسك.
واوضحت الصحيفة
ان ثلث
الحمولة كان من
حصة كرواتيا
والثلثين
الباقيين
نقلا الى
المسلمين
فى شاحنات
حكومية.
ونقلت الصحيفة
عن مصدر
عسكري بوسني "انها اول
قافلة اسلحة معروفة
بهذه الاهمية
منذ بدء
الحرب.
ISI_ara_eng_parallel_corpus.ara.doc
AFP_ARB_19940513.0001
AFP_ARB_19940513.0003
AFP_ARB_19940513.0003
AFP_ARB_19940513.0004
AFP_ARB_19940513.0004
ISI_ara_eng_parallel_corpus.score
0.977116
0.840605
0.691467
0.904404
0.851603