Authors: Mohamed Maamouri (Project head), Ann Bies, Tim Buckwalter, Hubert Jin Annotators: Zohra Bentaouit, Noureddine Bessaidi, Fatima Chebchoub, Rachida Fathallah, Niama Laadioui PROJECT GOAL To support the development of data-driven approaches to natural language processing (NLP), human language technologies, automatic content extraction (topic extraction and/or grammar extraction), cross-lingual information retrieval, information detection, and other forms of linguistic research on Modern Standard Arabic in general, the LDC was sponsored to develop an Arabic POS and Treebank of 1,000,000 words. This corpus is part three of that project. In this release, we provide annotation on part of speech (POS), gloss, and word segmentation. SOURCE DATA For this "Arabic Treebank: Part 3 v 1.0" corpus (ATB3), we selected text from An Nahar News Agency in the GIGAWORD ARABIC text corpus published by LDC in 2003 (LDC2003T12). For more details, please see: readme-arabic-gigaword.txt. There are 600 stories (specified by the DOC Id) in this corpus, dated on the 15th day of each month ranging from Jan to Dec in 2002. The average number of words per story is around 567, and there are a total of 340,281 words/tokens. LEXICON Tim Buckwalter's transliteration system, which we use for this corpus, is described at http://www.ldc.upenn.edu/myl/morph/buckwalter.html. As in the past, we used Tim Buckwalter's morphological analyzer to generate the candidate list of POS values for each word/token and our annotators picked the appropriate one manually. The coverage of Tim Buckwalter's morphological analyzer on this corpus is listed in ATB3_Coverage_Statistics.txt. ANNOTATION PROCEDURE We did stand-off annotation on the data. The sgm files are read-only after the collection/processing. POS annotation is done only on the text under the

tag. Different from the previous Arabic treebank releases "Arabic Treebank: Part 1 v 2.0" [LDC2003T06] and "Arabic Treebank: Part 2 v 2.0" [LDC2004T02], headlines are annotated in this corpus (ATB3). DIRECTORY STRUCTURE In the data/ directory, you will find the following: sgm - Processed source files in sgml format. Please note that there is a parallel text corpus being developed at LDC for these same 600 source files. xml - The AG xml files containing the POS annotation. The dtd files for the AG format are also included there. The xml files are compressed. pos - POS annotation output in plain text For each of the files in docs/doclist.txt, there are: *.sgm files in data/sgm Arabic in utf-8 *.xml.gz files in data/xml Annotation Graph (AG) based annotation xml files with Tim Buckwalter's lexicon. POS annotators worked on the xml files using LDC developed tools. *.xml.txt files in data/pos POS output in ASCII except for the Arabic words in utf-8. Note: This output is from tokens before clitic separation. Ann Bies, bies@ldc.upenn.edu Tim Buckwalter, timbuck2@ldc.upenn.edu Hubert Jin, hubertj@ldc.upenn.edu Mohamed Maamouri, maamouri@ldc.upenn.edu April 20, 2004