Corpus Documentation for Arabic Treebank: Part 1 - 10K-word English Translation Project Goal: To support the development of data-driven approaches to natural language processing (NLP), machine translation, human language technologies, cross-lingual information retrieval, and other forms of linguistic research on Modern Standard Arabic in general, the LDC was sponsored to develop this corpus of 10K Arabic words translated into English. Source Data: The project targets the translation of a written Modern Standard Arabic corpus from the Agence France Presse (AFP) newswire archives for July 2000 (files dated 20000715). The corpus consists if 49 source stories, which is a subset of the 734 stories corpus (Arabic Treebank: Part 1 v 2.0, LDC catalog number LDC2003T06). Summary of source data (including headlines): 49 files 418 paragraphs 9981 words Data Format: The source data and translation are both stored in SGML format. For more details please consult http://www.ldc.upenn.edu/Projects/TIDES/Translation/Arabic/Final_Data_Format_MTA_Corpora.pdf All files have been validated using the DTD provided in docs/mt.dtd. The only file which is not compliant with this DTD is 20000715_AFP_ARB.0034.sgm, because it doesn't have a deadline. Human Translation Procedure: This corpus has 49 stories, and all of them were selected from the 734 stories in the Arabic Treebank: Part 1 v 2.0 (LDC2003T06) corpus. The stories were translated at the paragraph level and verified/corrected by different annotators. In general, the translation between Arabic and English has been aligned at sentence-to-sentence level. However, it has been noticed that an Arabic sentence could be translated into multiple English sentences (16 occurrences), as well as two Arabic sentences be translated into a single English sentence (2 occurrences). For 18 paragraphs out of the total of 418 in the corpus, only paragraph-to-paragraph alignment is provided. Details on the 18 occurrences can be found in docs/record. Ann Bies, bies@ldc.upenn.edu Moussa Bamba, mbamba@ldc.upenn.edu Hubert Jin, hubertj@ldc.upenn.edu Mohamed Maamouri, maamouri@ldc.upenn.edu Xiao Ma, xma@ldc.upenn.edu January 30, 2003