TITLE: Arabic Treebank: Part 4 v1.0 (MPG Annotation) Authors: Mohamed Maamouri (Project head), Ann Bies, Tim Buckwalter, Hubert Jin Annotators: Fatima Chebchoub, Rachida Fathallah, Dalila Tabessi PROJECT GOAL To support the development of data-driven approaches to natural language processing (NLP), human language technologies, automatic content extraction (topic extraction and/or grammar extraction), cross-lingual information retrieval, information detection, and other forms of linguistic research on Modern Standard Arabic in general, the LDC was sponsored to develop an Arabic POS and Treebank of 1,000,000 words. This corpus is part four of that project. In this release, we provide annotation on part of speech (POS), gloss, and word segmentation. SOURCE DATA For this "Arabic Treebank: Part 4 v 1.0" corpus (ATB4), we selected text from Assabah, which is a Tunisian daily newspaper in Modern Standard Arabic. It is published in Tunis, Tunisia by the "Assabah Press Group" owned by the Cheikhrouhou family. There are 397 stories (specified by the DOC Id) in this corpus, dated from September to November in 2004. The average number of words per story is slightly above 400, and there are a total of 161,915 words/tokens in the corpus. Files relating to sports, financial data and other domains, such as horoscopes etc., were not kept in the corpus. LEXICON Tim Buckwalter's transliteration system, which we use for this corpus, is described at http://www.ldc.upenn.edu/myl/morph/buckwalter.html. As in the past, we used Tim Buckwalter's morphological analyzer to generate the a candidate list of POS values for each word/token and our annotators picked the appropriate one manually. The coverage of Tim Buckwalter's morphological analyzer on this corpus is in ATB4_Coverage_Statistics.doc. ANNOTATION PROCEDURE We did stand-off annotation on the data. The sgm files are read-only after the collection/processing. POS annotation is done only on the text under the

tag. Same as "Arabic Treebank: Part 3 v 1.0" [LDC2004T11], headlines are also annotated in this corpus. Tim Buckwalter's lexicon and morphological analyzer was used to generated a candidate list of POS tags for each word. (Please note that some words do not exist in this lexicon.) The POS task is just to select the correct POS tag. DIRECTORY STRUCTURE In the data/ directory, you will find the following: sgm - Processed source files in sgml format. Please note that there is a parallel text corpus being developed at LDC for these same 397 source files. xml - The AG xml files containing the POS annotation. The dtd files for the AG format are also included there. The xml files are compressed. pos - POS annotation output in plain text For each of the files in docs/doclist, there are: *.sgm file in data/sgm Arabic in utf-8 *.xml.gz file in data/AG_xml Annotation Graph (AG) based annotation xml file with Tim Buckwalter's lexicon. POS annotators worked on the xml files using LDC developed tools. *.txt file in data/pos POS output in ASCII except the Arabic words in utf-8. Note: This output is from tokens before clitic separation. Ann Bies, bies@ldc.upenn.edu Tim Buckwalter, timbuck2@ldc.upenn.edu Hubert Jin, hubertj@ldc.upenn.edu Mohamed Maamouri, maamouri@ldc.upenn.edu MAY 20, 2005