Corpus Documentation for Arabic Treebank: Part 1 v 2.0 
1/28/03


PROJECT GOAL

To support the development of data-driven approaches to natural language
processing (NLP), human language technologies, automatic content extraction
(topic extraction and/or grammar extraction), cross-lingual information
retrieval, information detection, and other forms of linguistic research on
Modern Standard Arabic in general, the LDC was sponsored to develop an
Arabic Treebank of 1,000,000 words.  This corpus is part one of that
project.


CORPUS DESCRIPTION

Treebanks are language resources that provide annotations of natural
languages at various levels of structure: at the word level, the phrase
level, and the sentence level. Treebanks have become crucially important
for the development of data-driven approaches to natural language
processing (NLP), human language technologies, automatic content extraction
(topic extraction and/or grammar extraction), cross-lingual information
retrieval, information detection, and other forms of linguistic research in
general. 

This corpus is designed for those who study and use languages either
professionally or academically, and who need text corpora in their
work. The Penn Arabic Treebank is particularly suitable for language
developers, computational linguists and computer scientists who are
interested in various aspects of natural language processing. 

The Penn Arabic Treebank, which is part of the DARPA TIDES project, started
in the Fall of 2001 with the objective of annotating via human intervention
and automatically a large Arabic machine-readable text corpus (see project
background at the following URL address:
http://www.ldc.upenn.edu/Projects/TIDES/Arabic/data/POS/POStest.html).  As
in previous Penn Treebanks, two different kinds of information need to be
produced by two different (human and computer) processes. The Arabic
Treebank project consists therefore of two distinct phases: (a)
Part-of-Speech (=POS) tagging which divides the text into lexical tokens,
and gives relevant information about each token such as lexical category,
inflectional features, and a gloss, and (b) Arabic Treebanking (=ArabicTB)
which characterizes the constituent structures of word sequences, provides
categories for each non-terminal node, and identifies null elements,
co-reference, traces, etc.  Both tasks started in November 2001 with an
initial pilot consisting of 734 files representing roughly 166K words of
written Modern Standard Arabic newswire from the Agence France Presse
corpus.


SOURCE DATA

The project targets the description of a written Modern Standard Arabic
corpus from the Agence France Presse (AFP) newswire archives for
July-November 2000 (files dated 20000715 to 20001115).  This corpus
includes 734 stories representing 140,265 words (168,123 tokens after
clitic segmentation in the Treebank).  For this work, annotators must be
native speakers of Arabic and they must understand enough linguistics to
check morphosyntactic analysis and build syntactic structures.


ANNOTATION PROCEDURE

We did stand-off annotation on the AFP data.  The sgm files are read-only
after the collection/processing described in
technical-characteristics.txt. POS and treebanking annotation are done only
on the text under the <P> tag. The headline is not annotated for either
part-of-speech or syntactic structure.

First, Tim Buckwalter's lexicon and morphological analyzer was used to
generated a candidate list of POS tags for each word. (Please note that some
words do not exist in this lexicon.) The POS task is just to select the
correct POS tag.  Once POS is done, we automatically separated the clitics
based on the POS selection. Also, we added a NUM tag for numerical data,
PUNC for punctuation and NON_ARABIC for other tokens that are not Arabic
(which could include English letters, or a combination of digits,
punctuation and letters). At this stage, NO_FUNC was added as a POS tag for
any Arabic word that had no selected tag, and NON_ALPHABETIC for any
untagged non-Arabic word. Then, the data (i.e., xml files) went through
treebank annotation. After that was done, we checked for inconsistencies
between the treebank and POS annotation. Many of the inconsistencies were
corrected manually by annotators or automatically by script if reliably
safe and possible to do so.

Tim Buckwalter's transliteration system, which we use for this corpus, is
described at http://www.ldc.upenn.edu/myl/morph/buckwalter.html.


PREVIOUS RELEASES 

(a) Intermediate provisional releases of Arabic POS,TB, and tools include: 
	- 140K POS data to BBN, IBM and JHU
	- 118K TB1 data to BBN
	- Use of LDC ARB TreeEditor and the 140K TB-tagged corpus at BBN for
	  creation of Arabic FactBrowser 
	- Ported the POS annotation tool to Windows and shared with the
	  Prague Charles University Arabic Treebank Group 

(b) E-release of ATB Part 1 (provisionally annotated POS + TB) under the
following: 
	Title: Arabic Treebank: Part 1 v 1.0
	Catalog number: LDC2002E55
	ftp distribution

(c) The Buckwalter Arabic Morphological Analyzer Version 1.0
Created by Tim Buckwalter at Qamus for POS-tagging Arabic text, the
analyzer consists primarily of three Arabic-English lexicon files:
prefixes, suffixes, and stems.  The lexicons are supplemented by three
morphological compatibility tables used for controlling prefix-stem
combinations, stem-suffix combinations, and prefix-suffix combinations.
The LDC is releasing this software under the GNU General Public License:
http://www.gnu.org/copyleft/gpl.html
For information on commercial use, please visit:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49
Buckwalter Arabic Morphological Analyzer can be downloaded for free from
the above link.  If you would like a copy placed on CD-ROM, please note
that there is a $100 media charge.


CORRECTIONS TO THE CORPUS

We are aware that there are still many imperfections in this release, in
spite of various systematic and individual corrections made.  It is our
belief that there is nothing serious in the remaining errors which will
hinder the use of this treebank.  Our intention is to continue our
correction process and provide version 3.0 as soon as possible.  We trust
that our users will be understanding, and we would very much appreciate
receiving any form of feedback that will help towards that end.  Please
contact us if you need more specific information.


DIRECTORY STRUCTURE

In the data/ directory:

For each of the files in docs/doclist, there are:

*.sgm files in data/sgm
        Arabic in utf-8

*.xml files in data/AG_xml
        Annotation Graph (AG)-based annotation xml files with Tim
	Buckwalter's lexicon. 
        POS and treebanking annotators worked on the xml files
        using LDC-developed tools.

*.tree files in data/treebank/with-vowel
        Penn Treebanking style output
        (Note: Only the selected words have vowels)

*.tree files in data/treebank/without-vowel
        Penn Treebanking style output

*.txt files in data/pos/before-treebank
        POS output in ASCII except the Arabic words in utf-8

*.txt files in data/pos/after-treebank
        POS output in ASCII except the Arabic words in utf-8
        (with clitics separated, automatic tag insertion
         for number, punctuation and non-Arabic stuff,
         and extra human annotation for some of the words
         that have no POS solutions)

In the appendix/ directory:

The script we used to generate the Penn English Treebank style output and
the POS output is in appendix/bin, for users who prefer not to use the
AG-based .xml files.  However, we recommend that people use the AG files,
as they contain other important information in the full annotation such as
the English gloss and the annotators' comments.

In the docs/ directory:

More detailed information about the part-of-speech corpus and annotation
process can be found in POS-info.txt, and skeletal annotation guidelines
can be found in guidelines-POS-1-28-03.pdf.  An explanation of how to
convert the Arabic POS tags to the old-style Penn English Treebank POS tags
is in arabic-POStags-collapse-to-PennPOStags.txt.

More detailed information about the treebanked/parsed tree corpus and its
annotation process can be found in TBParsing-info.txt, and draft annotation
guidelines can be found in guidelines-TB-1-28-03.pdf.  Updates will be
available on the LDC website and at www.ircs.upenn.edu/arabic.

The technical characteristics of the AFP corpus are described in
technical-characteristics.txt. 


----------------------------------------
Ann Bies, bies@ldc.upenn.edu
Tim Buckwalter, timbuck2@ldc.upenn.edu
Hubert Jin, hubertj@ldc.upenn.edu
Mohamed Maamouri, maamouri@ldc.upenn.edu
January 28, 2003