TITLE: Arabic Treebank: Part 3 (full corpus) v2.0 (MPG + Syntactic Analysis)

Authors: Mohamed Maamouri (Project head), Ann Bies, Tim Buckwalter, 
         Hubert Jin, Wigdan Mekki
Annotators: Wigdan Mekki (Lead Annotator), Tasneem Ghandour, Fatma Gaddeche

PROJECT GOAL

To support the development of data-driven approaches to natural language
processing (NLP), human language technologies, automatic content extraction
(topic extraction and/or grammar extraction), cross-lingual information
retrieval, information detection, and other forms of linguistic research on
Modern Standard Arabic in general, the LDC was sponsored to develop an
Arabic POS and Treebank of 1,000,000 words.  This corpus is part three of 
that project. In this release, we provide annotation on part of speech
(POS), gloss, and word segmentation.


PROJECT AND CORPUS DESCRIPTION: Penn Arabic Treebank (ATB)

Treebanks are language resources that provide annotations of natural
languages at various levels of structure: at the word level, the phrase
level, and the sentence level. Treebanks have become crucially important
for the development of data-driven approaches to natural language
processing (NLP), human language technologies, automatic content extraction
(topic extraction and/or grammar extraction), cross-lingual information
retrieval, information detection, and other forms of linguistic research in
general. 

This corpus is designed for those who study and use languages either
professionally or academically, and who need text corpora in their
work. The Penn Arabic Treebank is particularly suitable for language
developers, computational linguists and computer scientists who are
interested in various aspects of natural language processing. 

The Penn Arabic Treebank, which is part of the DARPA TIDES project, started
in the Fall of 2001 with the objective of annotating via human intervention
and automatically a large Arabic machine-readable text corpus (see project
background at the following URL address:
http://www.ldc.upenn.edu/Projects/TIDES/Arabic/data/POS/POStest.html).  As
in previous Penn Treebanks, two different kinds of information need to be
produced by two different (human and computer) processes. The Arabic
Treebank project consists therefore of two distinct phases: (a)
Part-of-Speech (=POS) tagging which divides the text into lexical tokens,
and gives relevant information about each token such as lexical category,
inflectional features, and a gloss (referred to as POS for convenience,
although it includes morphological and gloss information not traditionally
included with part-of-speech annotation), and (b) Arabic Treebanking
(=ArabicTB) which characterizes the constituent structures of word
sequences, provides categories for each non-terminal node, and identifies
null elements, co-reference, traces, etc.  

Both tasks started in November 2001 with an initial pilot consisting of 734
files representing roughly 166K words of written Modern Standard Arabic
newswire from the Agence France Presse corpus, which has since been
released as "Arabic Treebank: Part 1 v 3.0", LDC Catalog No. LDC2005T02.
After that, we finished and released the 168K corpus named "Arabic 
Treebank: Part 2 v 2.0", LDC Catalog No. LDC2004T02. 

The current Arabic Treebank: Part 3 corpus consists of 600 stories from
An Nahar News Agency. This corpus is also referred to as ANNAHAR. The new 
features include complete vocalization of all Imperfect Verb mood endings: 
Indicative, Subjunctive, and Jussive.

We released the POS only corpus for the ANNAHAR in 2004 under the catalog 
number LDC2004T11 (Arabic Treebank: Part 3 v1.0). In addition to the 
treebank annotation, this release (i.e. Arabic Treebank: Part 3 v2.0) 
also includes the POS annotation in LDC2004T11.

SOURCE DATA

For this "Arabic Treebank: Part 3 v 2.0" corpus (ATB3-v.2), we selected text 
from An Nahar News Agency in the GIGAWORD ARABIC TEXT CORPUS published by 
LDC in 2003. For more details, please see: readme-arabic-gigaword.txt.

There are 600 stories (specified by the DOC Id) in this corpus, dated
on the 15th day of each month ranging from Jan to Dec in 2002.  The
average number of words per story is around 567, and there are a total of
340,281 words/tokens. After clitics separation during treebank, the
number of tokens became 400,213.

LEXICON

Tim Buckwalter's transliteration system, which we use for this corpus, is
described at http://www.ldc.upenn.edu/myl/morph/buckwalter.html.

As in the past, we used Tim Buckwalter's morphological analyzer to generate
the a candidate list of POS values for each word/token and our annotators
picked the appropriate one manually. The coverage of Tim Buckwalter's
morphological analyzer on this corpus is in ATB3_Coverage_Statistics.doc.

ANNOTATION PROCEDURE

We did stand-off annotation on the data.  The sgm files are read-only after
the collection/processing. POS annotation is done only on the text under
the <P> tag. Different from the previous Arabic treebank releases "Arabic 
Treebank: Part 1 v 3.0" [LDC2005T02] and "Arabic Treebank: Part 2 v 2.0" 
[LDC2004T02], headlines are annotated in this corpus (ATB3).

First, Tim Buckwalter's lexicon and morphological analyzer was used to
generate a candidate list of POS tags for each word.  (Please note that
some words do not exist in this lexicon.)  The POS annotation task is just
to select the correct POS tag.  Once POS is done, we automatically
separated the clitics based on the POS selection.  The ANNAHAR corpus
contains 400,213 tokens after the separation of clitics (counting all
tokens, including non-Arabic tokens such as punctuation).  We use the
following tags for non-Arabic data: NUM for numerical data, PUNC for
punctuation, and LATIN for non-Arabic alphabetic data.  Then, the data
(i.e., xml files) went through treebank annotation.  After that was done,
we checked for inconsistencies between the treebank and POS annotation.
Many of the inconsistencies were corrected manually by annotators or
automatically by script if reliably safe and possible to do so.  In most
cases, the syntactic annotation was given precedence over the POS
annotation.  In the final treebank output in the form of the Penn treebank
style, NO_FUNC is added as a POS tag for any Arabic word that has no
selected tag, and NON_ALPHABETIC for any untagged non-Arabic token when
that occurs.


PREVIOUS MAJOR RELEASES 

(a) The Buckwalter Arabic Morphological Analyzer Version 2.0
	Catalog number: LDC2004L02

(b) Arabic Treebank: Part 1 v2.0
	Catalog number: LDC2003T06

(c) Arabic Treebank: Part 1 v3.0
	Catalog number: LDC2005T11
	(A new release of (b) with complete vocalization of all Imperfect 
         Verb mood endings: Indicative, Subjunctive, and Jussive.)

(d) Arabic Treebank: Part 2 v2.0
	Catalog number: LDC2004T02

(e) Arabic Treebank: Part 3 v1.0
	Catalog number: LDC2004T11

(f) Arabic Treebank: Part 3(a) v1.1
	Catalog number: LDC2004E71

(g) Arabic Treebank: Part 3(b) v1.1
	Catalog number: LDC2005E38

CORRECTIONS TO THE CORPUS

We are aware that there are still many imperfections in this release, in
spite of various systematic and individual corrections made.  It is our
belief that there is nothing serious in the remaining errors which will
hinder the use of this treebank.  Our intention is to continue our
correction process and provide version 3.0 as soon as possible.  We trust
that our users will be understanding, and we would very much appreciate
receiving any form of feedback that will help towards that end.  Please
contact us if you need more specific information.

DIRECTORY STRUCTURE

In the data/ directory, you will find the following:

    sgm - Processed source files in sgml format. Please note that there is a 
          parallel text corpus being developed at LDC for these same 600 
          source files.

    xml - The AG xml files containing the POS annotation. The dtd files for
          the AG format are also included there. The xml files are compressed.

    pos - POS annotation output in plain text

    penntree - POS and treebank files in Penn Treebank bracketed list format


For each of the files in docs/doclist, there are:

*.sgm file in data/sgm
        Arabic in utf-8

*.xml.gz file in data/xml/pos
        Annotation Graph (AG) based annotation xml file with Tim
        Buckwalter's lexicon. POS annotatorsworked on the xml files
        using LDC developed tools.

*.xml.gz file in data/xml/treebank
	We also provide the xml files after the clitics separation.
        Treebank annotators worked on these files using LDC developed 
        tools.

*.tree file in data/penntree/with-vowel
        Penn Treebanking style output
        (Note: Only the selected words have vowels)

*.tree file in data/penntree/without-vowel
        Penn Treebanking style output

*.txt file in data/pos/before-treebank
        POS output in ASCII except the Arabic words in utf-8
        Note: This output is from tokens before clitic separation.

*.txt file in data/pos/after-treebank
        POS output in ASCII except the Arabic words in utf-8
        (with clitics separated, automatic tag insertion
        for number, punctuation and non-Arabic stuff, and 
        extra human annotation for some of the words that 
        have no POS solutions)

Note: This release include all the POS annotation released in
      "Arabic Treebank: Part 3 v 1.0" [LDC2004T11]

Ann Bies, bies@ldc.upenn.edu
Tim Buckwalter, timbuck2@ldc.upenn.edu
Hubert Jin, hubertj@ldc.upenn.edu
Mohamed Maamouri, maamouri@ldc.upenn.edu
Wigdan Mekki, wmekki@ldc.upenn.edu
May 24, 2005