ISI Arabic-English Automatically Extracted Parallel Corpus

This distribution contains a corpus of Arabic-English parallel sentences, which were extracted automatically from two monolingual corpora: Arabic Gigaword 2 (LDC catalog number LDC2006T02) and English Gigaword 2 (LDC catalog number LDC2005T12).  The data was extracted from news articles published by the Xinhua News and the Agence France Presse news agencies and was obtained using the automatic parallel sentence identification method described in the following publication:
Dragos Stefan Munteanu, Daniel Marcu, 2005. Improving Machine Translation Performance by Exploiting Non-parallel Corpora, Computational Linguistics, 31(4):477-504 (a preliminary version can be found at http://www.isi.edu/~dragos/Docs/MunteanuMarcu_CL_2005.pdf, and the final version at http://portal.acm.org/citation.cfm?id=1110825.1110828)

The corpus contains 1,124,609 sentence pairs; the word count on the English side is approximately 31M words. The sentences in the parallel corpus preserve the form and encoding of the texts in the original Gigaword corpora.

For each sentence pair in the corpus we provide the names of the documents from which the two sentences were extracted, as well as a confidence score (between 0.5 and 1.0), which is  indicative of their degree of parallelism. The parallel sentence identification approach is designed to judge sentence pairs in isolation from their contexts, and can therefore find parallel sentences within document pairs which are not parallel. The fact that two documents share several parallel sentences does not necessarily mean the documents are parallel.

In order to make this resource useful for research in Machine Translation, we made efforts to detect potential overlaps between this data and the standard test and development data sets used by the MT community. The NIST 2002-2005 MT evaluation data sets contain several articles from Xinhua News and Agence France Presse. Sentence pairs in our distribution that have a 7-gram overlap with a sentence pair in a NIST MT evaluation set or sentence pairs coming from documents whose names are similar to those in the NIST MT sets are marked with a negative confidence score.

The distribution consists of 5 files:
- ISI_ara_eng_parallel_corpus.ara, ISI_ara_eng_parallel_corpus.eng: files containing the parallel text.
- ISI_ara_eng_parallel_corpus.ara.doc, ISI_ara_eng_parallel_corpus.eng.doc: metadata files indicating, for each sentence in the parallel corpus, the ID of the document from which it originated (the IDs are those used in the Gigaword 2 corpora).
- ISI_ara_eng_parallel_corpus.score: metadata file indicating a confidence score for each sentence pair in the corpus.  For sentences that overlap with the standard MT test sets, the scores are negative numbers (that is, zero minus the original score).



Below are several example sentence pairs from the corpus, together with their confidence scores.

وقال الوزراء ان توقف العمليات العسكرية يجب ان يرافقه فصل للقوات وسحب الاسلحة الثقيلة وانتشار قوة فصل من جنود الامم المتحدة.
"
The agreement on cessation of hostilities must include the separation of forces, the withdrawal of heavy weapons and the interposition of UNPROFOR troops.
(Confidence: 0.973045)
 
ويشار الى ان خطة التسوية الاوروبية تنص على منح المسلمين والكروات 51% من اراضي البلاد ومنح الصرب 49% من الاراضي مع اعلم انهم يسيطرون حاليا على 70% من اراضي البلاد.
An existing European plan gives 51 percent of Bosnia-Hercegovina to the Croats and Moslems, and 49 percent for the Serbs, who through their war-gains currently control about 70 percent.
(Confidence: 0.840605)

واكد المتحدث ان "لا بد من محاسبة كل من البيض وهيثم قاسم طاهر او مغادرتهم البلاد" موضحا ان "لا حاجة الى تشكيل حكومة انقاذ وطني لان هناك حكومة شرعية قائمة".
But the northern spokesman said there was "no need to form a national unity government, since the legitimate government is already in place."
(Confidence: 0.773351)

ودعا الحزب الاشتراكي اليمني في مبادرته ايضا الى "الفصل بين القوات المتواجهة وسحب القوات الى مواقعها السابقة قبل الحرب" بهدف "صيانة ما تبقى من القوات المسلحة".
Baid's Yemen Socialist Party (YSP) proposed the two armies should separate and be "withdrawn to the positions they held before the war."
(Confidence: 0.679145)

يلتقي وفد من جامعة الدول العربية في صنعاء اليوم الجمعة الرئيس اليمني علي عبد الله صالح لمحاولة اقناعه بالموافقة على وقف لاطلاق النار في المعارك مع خصومه الجنوبيين.
Meanwhile, the Arab League held a meeting with Yemeni leaders in Sanaa to try to broker a ceasefire between Saleh and Baid.
(Confidence: 0.5093)


To illustrate the file formats, we list below the first 5 lines from each file in the distribution:

ISI_ara_eng_parallel_corpus.eng
"It means that we are in mourning, or that we have given up land to the enemy."
An existing European plan gives 51 percent of Bosnia-Hercegovina to the Croats and Moslems, and 49 percent for the Serbs, who through their war-gains currently control about 70 percent.
He spoke after representatives of the 51-state Organization of the Islamic Conference (OIC) met in urgent session in Geneva and expressed "deep concerns" at the state of peace negotiations for Bosnia.
One-third of the load was allegedly delivered to Croatia and the rest transported by government trucks to the Moslems in Bosnia.
"This is the first weapons-related convoy from the highest level since the war began.

ISI_ara_eng_parallel_corpus.eng.doc
AFP_ENG_19940512.0212
AFP_ENG_19940513.0090
AFP_ENG_19940513.0090
AFP_ENG_19940514.0033
AFP_ENG_19940512.0138

ISI_ara_eng_parallel_corpus.ara
وقال احد المستوطنين الثلاثة "اننا نمزق ثيابنا احتراما لتقاليدنا وهذا يعني اننا في حداد او اننا تخلينا عن ارضنا للعدو".

ويشار الى ان خطة التسوية الاوروبية تنص على منح المسلمين والكروات 51% من اراضي البلاد ومنح الصرب 49% من الاراضي مع اعلم انهم يسيطرون حاليا على 70% من اراضي البلاد.
وقبيل افتتاح الاجتماع اعرب سفراء 51 دولة في منظمة الموءتمر الاسلامي عن اسفهم "لابعاد" المنظمة عن المفاوضات الجارية حاليا واكدوا من جديد على مبدأ سلامة ووحدة اراضي البوسنة والهرسك.
واوضحت الصحيفة ان ثلث الحمولة كان من حصة كرواتيا والثلثين الباقيين نقلا الى المسلمين فى شاحنات حكومية.
ونقلت الصحيفة عن مصدر عسكري بوسني "انها اول قافلة اسلحة معروفة بهذه الاهمية منذ بدء الحرب.

ISI_ara_eng_parallel_corpus.ara.doc
AFP_ARB_19940513.0001
AFP_ARB_19940513.0003
AFP_ARB_19940513.0003
AFP_ARB_19940513.0004
AFP_ARB_19940513.0004

ISI_ara_eng_parallel_corpus.score
0.977116
0.840605
0.691467
0.904404
0.851603