Authors: Mohamed Maamouri (Project head), Ann Bies, Tim Buckwalter, Hubert Jin
Annotators: Zohra Bentaouit, Noureddine Bessaidi, Fatima Chebchoub, 
            Rachida Fathallah, Niama Laadioui

PROJECT GOAL

To support the development of data-driven approaches to natural language
processing (NLP), human language technologies, automatic content extraction
(topic extraction and/or grammar extraction), cross-lingual information
retrieval, information detection, and other forms of linguistic research on
Modern Standard Arabic in general, the LDC was sponsored to develop an
Arabic POS and Treebank of 1,000,000 words.  This corpus is part three of 
that project. In this release, we provide annotation on part of speech
(POS), gloss, and word segmentation.


SOURCE DATA

For this "Arabic Treebank: Part 3 v 1.0" corpus (ATB3), we selected text 
from An Nahar News Agency in the GIGAWORD ARABIC text corpus published by 
LDC in 2003 (LDC2003T12). For more details, please see: readme-arabic-gigaword.txt.

There are 600 stories (specified by the DOC Id) in this corpus, dated
on the 15th day of each month ranging from Jan to Dec in 2002.  The
average number of words per story is around 567, and there are a total of
340,281 words/tokens.

LEXICON

Tim Buckwalter's transliteration system, which we use for this corpus, is
described at http://www.ldc.upenn.edu/myl/morph/buckwalter.html.

As in the past, we used Tim Buckwalter's morphological analyzer to generate
the candidate list of POS values for each word/token and our annotators
picked the appropriate one manually. The coverage of Tim Buckwalter's
morphological analyzer on this corpus is listed in ATB3_Coverage_Statistics.txt.

ANNOTATION PROCEDURE

We did stand-off annotation on the data.  The sgm files are read-only after
the collection/processing. POS annotation is done only on the text under
the <P> tag. Different from the previous Arabic treebank releases "Arabic 
Treebank: Part 1 v 2.0" [LDC2003T06] and "Arabic Treebank: Part 2 v 2.0" 
[LDC2004T02], headlines are annotated in this corpus (ATB3).

DIRECTORY STRUCTURE

In the data/ directory, you will find the following:

    sgm - Processed source files in sgml format. Please note that there is a 
          parallel text corpus being developed at LDC for these same 600 
          source files.

    xml - The AG xml files containing the POS annotation. The dtd files for
          the AG format are also included there. The xml files are compressed.

    pos - POS annotation output in plain text


For each of the files in docs/doclist.txt, there are:

*.sgm files in data/sgm
        Arabic in utf-8

*.xml.gz files in data/xml
        Annotation Graph (AG) based annotation xml files with Tim
        Buckwalter's lexicon. POS annotators worked on the xml 
        files using LDC developed tools. 

*.xml.txt files in data/pos
        POS output in ASCII except for the Arabic words in utf-8.
        Note: This output is from tokens before clitic separation.


Ann Bies, bies@ldc.upenn.edu
Tim Buckwalter, timbuck2@ldc.upenn.edu
Hubert Jin, hubertj@ldc.upenn.edu
Mohamed Maamouri, maamouri@ldc.upenn.edu
April 20, 2004