Arabic Treebank part 3 - v3.2
CatalogID: 
Release date: January 28, 2010
Linguistic Data Consortium
Authors: Mohamed Maamouri, Ann Bies, Seth Kulick, Sondos Krouna, Fatma Gaddeche, Wajdi Zaghouani

1. Introduction

This version of the Arabic Treebank part 3 - v3.2 is an incremental
update to the January 2009 release of Arabic Treebank part 3 - v3.1 to
the GALE community (LDC2008E22), and a significant revision over the
previous general catalog release of ATB3-v2.0 (LDC2005T20).

This version of Arabic Treebank part 3 - v3.2 represents a revision of
the ATB3 annotation for the full ATB part 3 (ANNAHAR) corpus.  The
full ATB3 corpus has been revised according to the new Arabic Treebank
annotation guidelines, both manually (all of the syntactic tree
annotation) and automatically (the MPG annotation).  The revised and
updated Arabic Treebank ATB part 3 consists of 599 newswire stories
from the An Nahar News Agency (previously released as Arabic Treebank:
Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis), LDC Catalog
No.:LDC2005T20).

This release includes all of the files that were previously released
to the GALE community as ATB3-v3.1, with additional quality control
added.  A number of inconsistencies in the 3.1 release data have been
corrected here, in particular additional corrections have been made to
certain POS tags, with the resulting tree changes, and as a result
additional clitics have been separated and some previously incorrectly
split tokens have now been merged.

This release includes significant new improvements over both the 
ATB3-v2.0 and ATB3-v3.1 releases in both the organization of the 
data and certain aspects of the annotation.  These improvements
are detailed in docs/readme-files.txt

One file from the original ATB3-v2.0 release has been removed from the
corpus (ANN20020715.0063), as the text is an exact duplicate of
another file in the corpus (ANN20020715.0018), taking the total number
of files down from 600 to 599.

In this full ATB3 corpus, there are a total of 339,710 words/tokens
before clitics are split and 402,291 words/tokens after clitics are
separated for the treebank annotation.

This current release contains the part-of-speech/morphology/gloss
annotation and the syntactic treebank annotation of these files.  The
treebank annotation has been revised in accordance with the new Arabic
Treebank Annotation Guidelines.  In addition to a partial manual
revision, certain automatic changes have been made to the
part-of-speech/morphology/gloss tags.  

Two papers written about the revision and enhancement process for ATB3
are available on the LDC website:

Enhancing the Arabic Treebank: A Collaborative Effort toward
New Annotation Guidelines. Mohamed Maamouri, Ann Bies, Seth
Kulick.  In Proceedings of the Sixth International Conference on
Language Resources and Evaluation (LREC 2008), Marrakech, Morocco,
May 28-30, 2008.
Available:
Paper in PDF: http://papers.ldc.upenn.edu/LREC2008/Enhancing_Arabic_Treebank.pdf
Poster: http://papers.ldc.upenn.edu/LREC2008/Enhancement-poster.ppt

Diacritic Annotation in the Arabic Treebank and its Impact on
Parser Evaluation. Mohamed Maamouri, Seth Kulick, Ann Bies.  In
Proceedings of the Sixth International Conference on Language
Resources and Evaluation (LREC 2008), Marrakech, Morocco, May
28-30, 2008.
Available:
Paper in PDF: http://papers.ldc.upenn.edu/LREC2008/Diacritic_Annotation_ATB.pdf
Poster: http://papers.ldc.upenn.edu/LREC2008/Diacritization-poster.ppt

2. Annotation

2.1 Tasks and Guidelines

The Arabic Treebank project consists of two distinct phases: (a)
Part-of-Speech (=POS) tagging which divides the text into lexical
tokens, and gives relevant information about each token such as
lexical category, inflectional features, and a gloss (referred to as
POS for convenience, although it includes morphological and gloss
information not traditionally included with part-of-speech
annotation), and (b) Arabic Treebanking (=ArabicTB) which
characterizes the constituent structures of word sequences, provides
categories for each non-terminal node, and identifies null elements,
co-reference, traces, etc.

Tim Buckwalter's transliteration system, which we use for this corpus, is
described at http://www.ldc.upenn.edu/myl/morph/buckwalter.html.

The revised Penn Arabic Treebank (PATB) Morphological and Syntactic
Annotation Guidelines are available on the LDC website at
http://projects.ldc.upenn.edu/ArabicTreebank/.

2.2 Annotation Process

In the original annotation, Tim Buckwalter's morphological analyzer
(BAMA) was used to generate a candidate list of POS values for each
word/token and our annotators picked the appropriate one manually.
(Please note that some words do not exist in this lexicon.)  The POS
annotation task is just to select the correct POS tag.  Once POS is
done, we automatically separated the clitics based on the POS
selection.  We use the following tags for non-Arabic data: NOUN_NUM or
ADJ_NUM for numerical data, PUNC for punctuation, and FOREIGN or LATIN
for non-Arabic alphabetic data.

For the current version, we implemented automatic checks on the
part-of-speech tags with consequent further manual revision when
necessary to ensure the consistency of the part-of-speech tags with
the current guidelines and with the currently in use LDC Standard
Arabic Morphological Analyzer LDC2009E73 (SAMA 3.1). This is discussed
in more detail below, in 7. Data Validation.

The Arabic morphological tagset was then reduced to a smaller POS set,
and the files were automatically parsed.  The parses were then hand
corrected by human annotators.  

For this release, the syntactic treebank annotation from ATB3-v2.0 was
manually revised according to the new Arabic Treebank Annotation
Guidelines.  Significant changes were made to NP structure and to
classification of verbs with clausal arguments, along with
improvements to the annotation in general.  Annotators for this
process were Wigdan Mekki and Fatma Gaddeche.

The QC process consists of a series of specific searches for several
types of potential inconsistency and annotation error.  Any errors
found in these searches were hand corrected.

3. Source Data Profile

3.1 Data Selection Process

This corpus of Arabic Treebank part 3 - v3.2 consists of 599 newswire
stories from the An Nahar News Agency.  There are a total of 339,710 
words/tokens before clitics are split and 402,291 words/tokens after
clitics are separated for the treebank annotation.

One file from the original ATB3-v2.0 release has been removed from the
corpus (ANN20020715.0063), as the text is an exact duplicate of
another file in the corpus (ANN20020715.0018), taking the total number
of files down from 600 to 599.

3.2 Data Sources and Epochs

Source texts were selected from An Nahar News Agency in the GIGAWORD
ARABIC TEXT CORPUS published by LDC in 2003 (LDC2003T12). For more
details, please see that release.

There are 599 stories (specified by the DOC ID), dated on the 15th day
of each month ranging from January to December in 2002.

4. Annotated Data Profile

This corpus of Arabic Treebank part 3 - v3.2 consists of 599 newswire
stories from the An Nahar News Agency.  There are a total of 339,710 
words/tokens before clitics are split and 402,291 words/tokens after
clitics are separated for the treebank annotation.

The source file IDs are listed in docs/file.ids.

We have also modified somewhat the format of the data in the various
.txt and .tree files, as extensively documented in docs/readme-files.txt.

5. Data Directory Structure

The directory structure for this data (and for the /data directory) is
in docs/file.tbl.

In the docs/ directory:

- ag.dtd, metadata.dtd               - These are dtd files for the AG XML.
- file.ids                           - A list of file ids in the corpus.
- file.tbl                           - Directory structure for everything 
                                       in this package.
- readme-files.txt                   - An extensive description of the 
                                       modifications that have been made 
                                       to the format of the data in the 
                                       various .txt and .tree files.
- tags-count.txt                     - A list of the POS/morphological tags 
                                       after the clitics are separated and 
                                       after treebank annotation, along with 
                                       the number of occurrences of each tag.
- atb3-v3.0-taglist-conversion-to-PennPOS-forrelease.lisp 
                                     - Lisp code mapping the full 
                                       morphological tags to a much smaller 
                                       list, similar to the Penn POS tagset, 
                                       strictly for convenience.

6. File Format Description

An extensive description of the file formats (and the types of files
present for each of the IDs in docs/file.ids) is in
docs/readme-files.txt, including a description of the modifications
that have been made to the format of the data in the various .txt and
.tree files compared with previous releases.

7. Data Validation

The original ATB3-v2.0 data went through the following annotation procedure:

POS procedure:

- All words went through Tim Buckwalter's morphological analyzer.
- All words are included in the first pass of POS where annotators select 
  one out of many choices provided by the morphological analyzer.
- All files went through a second pass of POS annotation where annotators 
  review the annotation done in the previous POS pass.

TB procedure:

- Words/tokens from the POS annotation are processed to separate clitics 
  in preparation for TB annotation. After clitic separation, the number 
  of words/tokens increases from 339,710 to 402,291.
- The sentences were pre-parsed to improve productivity.
- Annotators went through at least two pass of annotation with the help 
  of diagnostic QC searches to catch potential patterns of annotation errors.

For this current ATB3-v3.2 release, the following additional steps were taken:

TB procedure:

- The syntactic annotation guidelines were significantly revised, in 
  particular with respect to noun phrase structure (idafa), verb phrase 
  structure for verbs taking clausal complements, verb phrase structure 
  for non-inflectional verbs, and the structure surrounding the function 
  words with new tokenization. The revised Penn Arabic Treebank (PATB) 
  Morphological and Syntactic Annotation Guidelines are available on the 
  LDC website at http://projects.ldc.upenn.edu/ArabicTreebank/.
- The treebank annotation from ATB3-v2.0 was manually revised according 
  to the new Arabic Treebank Annotation Guidelines.  Significant changes 
  were made to NP structure and to classification of verbs with clausal 
  arguments, along with improvements to the annotation in general.
- Additional QC searches were run on the full ATB3, including some 
  relating to the relation between POS tags and TB nodes, and the results 
  were hand corrected.

POS procedure:

- The part-of-speech/morphological guidelines were significantly revised, 
  in particular with respect to the classification and tokenization of 
  closed class function words, classes of nouns (the addition of 
  NOUN_QUANT and NOUN_NUM), classes of adjectives (the addition of 
  ADJ_COMP and ADJ_NUM), and classes of non-inflectional verbs (the 
  addition of PSEUDO_VERB and VERB). The revised Penn Arabic Treebank 
  (PATB) Morphological and Syntactic Annotation Guidelines are available 
  on the LDC website at http://projects.ldc.upenn.edu/ArabicTreebank/.
- We have made various further automatic changes to the POS tags, 
  described below. 
- A limited number of manual corrections were made to the POS tags for 
  this version as well.

The annotators on this project were Fatma Gaddeche (Lead Annotator),
Ichraf Amghouz, Luma Ateyah, Basma Bouziri, Fatima Chebchoub, Rachida
Fathallah, Tasneem Ghandour, Badia Laadioui, Niama Laadioui, and
Wigdan Mekki.

Quality assurance & annotation checking for this release:

The current version of SAMA, 3.1, has significant differences from the
version of SAMA/BAMA current at the time this Treebank was originally
annotated.  Therefore, the goal has been to update the morphological
annotations to be consistent both with SAMA 3.1 and with the correct
part-of-speech/tree interaction as discussed in the guidelines.
"Consistency" here means that the morphological solution for a token
in the treebank is also one of the solutions for that token in SAMA
3.1.

For the initial revision of this corpus, each treebank token
mentioned, explicitly or implicitly (e.g., all the ADJ_COMP words,
determined by an examination of vowel patterns and then manual
filtering) in the morphological guidelines was annotated with a list
of its possible tags.  For tokens that had only one possible tag, of
course the current tag was modified, if necessary, to be that tag.
Second, for those words with more than one tag, we focused on some of
the most common and important cases, and did manual annotation of
those tokens.  In some cases, it was also possible to assign the tag
based on the tree context, since the tree annotation had already been
done.

This revision has now been carried further by explicitly testing every
token in the treebank against the possible SAMA solutions for that
token.  Where there have been discrepencies, the results have been
manually inspected and in many cases changed.  See
docs/readme-files.txt for an detailed description of this procedure
and docs/errata.txt for a listing of some of the remaining
discrepencies between the treebank and SAMA.

8. DTDs

Two for the AG XML files.

9. Copyright Information

Portions &copy; 2002 An Nahar, &copy; 2003, 2004, 2005, 2007, 2008,
2009, 2010 Trustees of the University of Pennsylvania

10. Contact Information

Contact info for key project personnel: 

Mohamed Maamouri, manager and senior researcher, maamouri@ldc.upenn.edu
Ann Bies, bies@ldc.upenn.edu
Seth Kulick, skulick@ldc.upenn.edu

11. Update Log

This index was updated on January 28, 2010 by Ann Bies.