Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + Syntactic Analysis)

Authors: Mohamed Maamouri (Project head), Ann Bies, Tim Buckwalter, Hubert Jin

Annotators: Wigdan El Mekki (Lead Annotator), Ichraf Amghouz, Zohra 
	    Bentaouit, Fatima Chebchoub, Fatima El Himyani, Rachida 
	    Fathallah, Alexa Firat, Tasneem Ghandour, Niama Laadioui, 
	    Mohamed Mansour, Sarah Tlili, Gordon Witty, Dalel Zakhary


PROJECT GOAL

To support the development of data-driven approaches to natural language
processing (NLP), human language technologies, automatic content extraction
(topic extraction and/or grammar extraction), cross-lingual information
retrieval, information detection, and other forms of linguistic research on
Modern Standard Arabic in general, the LDC was sponsored to develop an
Arabic Treebank of 1,000,000 words.  This corpus is a re-release of part 
one of that project, with the addition in this version 3.0 of improved 
morphological/part-of-speech annotation (including full vocalization and 
case endings). 


SOURCE DATA

The project targets the description of a written Modern Standard Arabic
corpus from the Agence France Presse (AFP) newswire archives for
July-November 2000 (files dated 20000715 to 20001115).  This corpus
includes 734 stories representing 145,386 words (166,068 tokens after
clitic segmentation in the Treebank; the number of Arabic tokens is
123,796).  For this work, annotators must be native speakers of Arabic, 
and they must understand enough linguistics to check morphosyntactic 
analysis and build syntactic structures.


LEXICON

Tim Buckwalter's transliteration system, which we use for this corpus, is
described at http://www.ldc.upenn.edu/myl/morph/buckwalter.html.

As in the past, we used Tim Buckwalter's morphological analyzer to generate
the a candidate list of POS values for each word/token and our annotators
picked the appropriate one manually. The coverage of Tim Buckwalter's
morphological analyzer on this corpus is in ATB1v3.0_Coverage_Statistics.txt.


CORPUS DESCRIPTION

Treebanks are language resources that provide annotations of natural
languages at various levels of structure: at the word level, the phrase
level, and the sentence level. Treebanks have become crucially important
for the development of data-driven approaches to natural language
processing (NLP), human language technologies, automatic content extraction
(topic extraction and/or grammar extraction), cross-lingual information
retrieval, information detection, and other forms of linguistic research in
general. 

This corpus is designed for those who study and use languages either
professionally or academically, and who need text corpora in their
work. The Penn Arabic Treebank is particularly suitable for language
developers, computational linguists and computer scientists who are
interested in various aspects of natural language processing. 

The Penn Arabic Treebank, which is part of the DARPA TIDES project, started
in the Fall of 2001 with the objective of annotating via human intervention
and automatically a large Arabic machine-readable text corpus (see project
background at the following URL address:
http://www.ldc.upenn.edu/Projects/TIDES/Arabic/data/POS/POStest.html).  As
in previous Penn Treebanks, two different kinds of information need to be
produced by two different (human and computer) processes. The Arabic
Treebank project consists therefore of two distinct phases: (a)
Part-of-Speech (=POS) tagging which divides the text into lexical tokens,
and gives relevant information about each token such as lexical category,
inflectional features, a gloss, and now in this version 2.0 also full 
vocalization including case endings, and (b) Arabic Treebanking (=ArabicTB)
which characterizes the constituent structures of word sequences, provides
categories for each non-terminal node, and identifies null elements,
co-reference, traces, etc.  Both tasks started in November 2001 with an
initial pilot consisting of 734 files representing roughly 166K words of
written Modern Standard Arabic newswire from the Agence France Presse
corpus.


ANNOTATION PROCEDURE

We did stand-off annotation on the AFP data.  The sgm files are read-only
after the collection/processing described in technical-characteristics.txt. 
POS and treebanking annotation are done only on the text under the <P> tag. 
The headline is not annotated for either part-of-speech or syntactic 
structure.

First, Tim Buckwalter's lexicon and morphological analyzer was used to
generate a candidate list of POS tags for each word. (Please note
that some words do not exist in this lexicon.) The POS task is just to
select the correct POS tag.  There is a NUM tag for numerical data, and 
a PUNC tag for punctuation. For any words that do not have appropriate
tags assigned, we put NONE_OF_THE_ABOVE in the corresponding POS fields 
in the AG XML files.

Once POS is done, we automatically separated the clitics based on the 
POS selection. Then, the data (i.e., xml files) went through treebank
annotation. After that was done, we checked for inconsistencies
between the treebank and POS annotation. Many of the inconsistencies
were corrected manually by annotators or automatically by script if
reliably safe and possible to do so. In the Penn Style treebank output,
we assigned a NO_FUNC tag for any token with a NONE_OF_THE_ABOVE 
notation in the POS field.

For this update, the POS annotation was entirely redone, to include
full vocalization, case endings, and the up-to-date lexicon coverage.
The treebank annotation was not redone and has been carried over
automatically from the previous version.

PREVIOUS RELEASES 

Arabic Treebank: Part 1 v 2.0,  LDC Catalog No.: LDC2003T06
Arabic Treebank: Part 2 v 2.0,  LDC Catalog No.: LDC2004T02
Arabic Treebank: Part 3 v 1.0,  LDC Catalog No.: LDC2004T11
Arabic Treebank: Part 3(a) v 1.1,  LDC Catalog No.: LDC2004E71


CORRECTIONS TO THE CORPUS

We are aware that there are still many imperfections in this release, in
spite of various systematic and individual corrections made.  It is our
belief that there is nothing serious in the remaining errors which will
hinder the use of this treebank.  We trust that our users will be 
understanding, and we would very much appreciate receiving any form of 
feedback that will help towards that end.  Please contact us if you need 
more specific information.


DIRECTORY STRUCTURE

In the data/ directory:

For each of the files in docs/doclist, there are:

*.sgm file in data/sgm
        Arabic in utf-8

*.xml file in data/AG_xml/pos 
        Annotation Graph (AG) based annotation xml file with Tim
	Buckwalter's lexicon. POS annotators worked on the xml 
	files using LDC developed tools.

*.xml file in data/AG_xml/treebank
	After the POS annotation, clitics separation is applied 
	on the tokens in the POS AG_xml files for treebank purpose.
	Treebank annotators may also the treebank tool to correct 
	some POS errors. and/or clitics separation errors. Again,
	all annotation (clitics separated POS and Treebank) are 
	kept in these AG based XML files.

*.tree file in data/treebank/with-vowel
        Penn Treebanking style output
        (Note: Only the selected words have vowels)

*.tree file in data/treebank/without-vowel
        Penn Treebanking style output

*.txt file in data/pos/before-treebank
        POS output in ASCII except the Arabic words in utf-8

*.txt file in data/pos/after-treebank
        POS output in ASCII except the Arabic words in utf-8
        (with clitics separated, automatic tag insertion
         for number, punctuation and non-Arabic stuff,
         and extra human annotation for some of the words
         that have no POS solutions)

In the appendix/ directory:
	There are some python scripts that we used to extract POS,
	and Treebank information from the AG based XML files. Also
	provided is a python script used to do clitics separation.

The script we used to generate the Penn English Treebank style output and
the POS output is in appendix/, for users who prefer not to use the
AG-based .xml files.  However, we recommend that people use the AG files,
as they contain other important information in the full annotation such as
the English gloss and the annotators' comments.

In the docs/ directory:

More detailed information about the part-of-speech corpus and annotation
process can be found in POS-info.txt, and skeletal annotation guidelines
can be found in guidelines-POS-1-28-03.pdf.  An explanation of how to
convert the Arabic POS tags to the old-style Penn English Treebank POS 
tags is in taglist-conversion-to-PennPOS.lisp.  The coverage of Tim 
Buckwalter's morphological analyzer on this corpus is in 
ATB1v3.0_Coverage_Statistics.txt.  Information on the POS taglist changes, 
selected tags, and their frequency can be found in the taglist_* files.

More detailed information about the treebanked/parsed tree corpus and its
annotation process can be found in TBParsing-info.txt, and draft annotation
guidelines can be found in guidelines-TB-1-28-03.pdf.  Updates will be
available on the LDC website and at www.ircs.upenn.edu/arabic.

The technical characteristics of the AFP corpus are described in
technical-characteristics.txt. 


----------------------------------------
Ann Bies, bies@ldc.upenn.edu
Tim Buckwalter, timbuck2@ldc.upenn.edu
Hubert Jin, hubertj@ldc.upenn.edu
Mohamed Maamouri, maamouri@ldc.upenn.edu
November 30, 2004