Corpus Documentation for Arabic Treebank: Part 2 v 2.0
1/13/04

Authors: Mohamed Maamouri (Project head), Ann Bies, Tim Buckwalter, Hubert
Jin

Annotators: Wigdan Mekki (Lead Annotator), Tasneem Ghandour, Ichraf
Amghouz, Zohra Bentaouit, Nourredine Bessaidi, Rachida Fathallah, Niama
Laadioui, Abid Labidi, Dalal Zakhary


PROJECT GOAL

To support the development of data-driven approaches to natural language
processing (NLP), human language technologies, automatic content extraction
(topic extraction and/or grammar extraction), cross-lingual information
retrieval, information detection, and other forms of linguistic research on
Modern Standard Arabic in general, the LDC was sponsored to develop an
Arabic Treebank of 1,000,000 words.  This corpus is part two of that
project.


PROJECT AND CORPUS DESCRIPTION: Penn Arabic Treebank (ATB)

Treebanks are language resources that provide annotations of natural
languages at various levels of structure: at the word level, the phrase
level, and the sentence level. Treebanks have become crucially important
for the development of data-driven approaches to natural language
processing (NLP), human language technologies, automatic content extraction
(topic extraction and/or grammar extraction), cross-lingual information
retrieval, information detection, and other forms of linguistic research in
general. 

This corpus is designed for those who study and use languages either
professionally or academically, and who need text corpora in their
work. The Penn Arabic Treebank is particularly suitable for language
developers, computational linguists and computer scientists who are
interested in various aspects of natural language processing. 

The Penn Arabic Treebank, which is part of the DARPA TIDES project, started
in the Fall of 2001 with the objective of annotating via human intervention
and automatically a large Arabic machine-readable text corpus (see project
background at the following URL address:
http://www.ldc.upenn.edu/Projects/TIDES/Arabic/data/POS/POStest.html).  As
in previous Penn Treebanks, two different kinds of information need to be
produced by two different (human and computer) processes. The Arabic
Treebank project consists therefore of two distinct phases: (a)
Part-of-Speech (=POS) tagging which divides the text into lexical tokens,
and gives relevant information about each token such as lexical category,
inflectional features, and a gloss (referred to as POS for convenience,
although it includes morphological and gloss information not traditionally
included with part-of-speech annotation), and (b) Arabic Treebanking
(=ArabicTB) which characterizes the constituent structures of word
sequences, provides categories for each non-terminal node, and identifies
null elements, co-reference, traces, etc.  

Both tasks started in November 2001 with an initial pilot consisting of 734
files representing roughly 166K words of written Modern Standard Arabic
newswire from the Agence France Presse corpus, which has since been
released as Arabic Treebank: Part 1 v 2.0, LDC Catalog No. LDC2003T06.

The current Arabic Treebank: Part 2 corpus consists of stories from
Al-Hayat distributed by Ummah.  The Arabic Treebank: Part 2 is referred to
as UMAAH (for UMmAh's Al-Hayat).  New features of annotation in UMAAH
include complete vocalization (including case endings), lemma IDs, and
more specific POS tags for verbs and particles.


SOURCE DATA

This corpus includes 501 stories from the Ummah Arabic News Text. There are
a total of 144,199 words (counting non-Arabic tokens such as numbers and
punctuation) in the 501 files - one story per file.  For this work,
annotators must be native speakers of Arabic and they must understand
enough linguistics to check morphosyntactic analysis and build syntactic
structures.

Tim Buckwalter's Arabic morphological analysis tool is used to generate
potential candidate list for the POS annotation. It now includes full
vowelization and case endings.

The UMAAH corpus contains 125,698 Arabic-only word tokens (prior to the
separation of clitics), of which 124,740 (99.24%) were provided with an
acceptable morphological analysis and POS tag by the morphological parser,
and 958 (0.76%) were items that the morphological parser failed to analyze
correctly.

=====================================
items with solution    124740  99.24%
items with no solution    958   0.76%
-------------------------------------
total                  125698 100.00%
=====================================


LEXICON

Tim Buckwalter's transliteration system, which we use for this corpus, is
described at http://www.ldc.upenn.edu/myl/morph/buckwalter.html.


ANNOTATION PROCEDURE

We did stand-off annotation on the data.  The sgm files are read-only after
the collection/processing described in technical-characteristics.txt. POS
is done only on the text under the <P> tag.  The headline is not annotated
for either part-of-speech or syntactic structure.

First, Tim Buckwalter's lexicon and morphological analyzer was used to
generate a candidate list of POS tags for each word.  (Please note that
some words do not exist in this lexicon.)  The POS annotation task is just
to select the correct POS tag.  Once POS is done, we automatically
separated the clitics based on the POS selection.  The UMAAH corpus
contains 168,297 tokens after the separation of clitics (counting all
tokens, including non-Arabic tokens such as punctuation).  We use the
following tags for non-Arabic data: NUM for numerical data, PUNC for
punctuation, and LATIN for non-Arabic alphabetic data.  Then, the data
(i.e., xml files) went through treebank annotation.  After that was done,
we checked for inconsistencies between the treebank and POS annotation.
Many of the inconsistencies were corrected manually by annotators or
automatically by script if reliably safe and possible to do so.  In most
cases, the syntactic annotation was given precedence over the POS
annotation.  In the final treebank output in the form of the Penn treebank
style, NO_FUNC is added as a POS tag for any Arabic word that has no
selected tag, and NON_ALPHABETIC for any untagged non-Arabic token when
that occurs.


PREVIOUS RELEASES 

(a) E-release of ATB Part 2 (provisionally annotated POS only) under the
following: 
	Title: Arabic Treebank: Part 2 v 1.0
	Catalog number: LDC2003E17
	ftp distribution

(b) E-release of ATB Part 2 (provisionally annotated POS + TB) under the
following: 
	Title: Arabic Treebank: Part 2 v 1.1
	Catalog number: LDC2003E24
	ftp distribution

(c) The Buckwalter Arabic Morphological Analyzer Version 1.0
	Catalog number: LDC2002L49
	http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49


CORRECTIONS TO THE CORPUS

We are aware that there are still many imperfections in this release, in
spite of various systematic and individual corrections made.  It is our
belief that there is nothing serious in the remaining errors which will
hinder the use of this treebank.  Our intention is to continue our
correction process and provide version 3.0 as soon as possible.  We trust
that our users will be understanding, and we would very much appreciate
receiving any form of feedback that will help towards that end.  Please
contact us if you need more specific information.


DIRECTORY STRUCTURE

In the data/ directory:

For each of the files in docs/doclist, there are:

*.sgm files in data/sgm
        Arabic in utf-8

*.xml files in data/AG_xml
        Annotation Graph (AG) based annotation xml file with Tim
        Buckwalter's lexicon. 
        POS and treebanking annotators worked on the xml files
        using LDC developed tools.

*.tree files in data/treebank/with-vowel
        Penn Treebanking style output
        (Note: Only the selected words have vowels)

*.tree files in data/treebank/without-vowel
        Penn Treebanking style output

*.txt files in data/pos/before-treebank
        POS output in ASCII except the Arabic words in utf-8

*.txt files in data/pos/after-treebank
        POS output in ASCII except the Arabic words in utf-8
        (with clitics separated, automatic tag insertion
         for number, punctuation and non-Arabic stuff,
         and extra human annotation for some of the words
         that have no POS solutions)


In the bin/ directory:

The script we used to generate the Penn English Treebank style output and
the POS output is in bin, for users who prefer not to use the
AG-based .xml files.  However, we recommend that people use the AG files,
as they contain other important information in the full annotation such as
the English gloss and the annotators' comments.

In the docs/ directory:

More detailed information about the part-of-speech corpus and annotation
process can be found in POS-info.txt, and skeletal annotation guidelines
can be found in guidelines-POS-1-28-03.pdf.  A table for converting the
Arabic POS tags for UMAAH to the old-style Penn English Treebank POS tags
is in arabicUMAAH-POStags-collapse-to-PennPOStags.txt.

More detailed information about the treebanked/parsed tree corpus and its
annotation process can be found in TBParsing-info.txt, and draft annotation
guidelines can be found in guidelines-TB-1-28-03.pdf.  Updates will be
available on the LDC website and at www.ircs.upenn.edu/arabic.

The technical characteristics of the UMAAH corpus are described in
technical-characteristics.txt. 


----------------------------------------
Ann Bies, bies@ldc.upenn.edu
Tim Buckwalter, timbuck2@ldc.upenn.edu
Hubert Jin, hubertj@ldc.upenn.edu
Mohamed Maamouri, maamouri@ldc.upenn.edu
January 13, 2004