Arabic Treebank part 3 - v3.0

CatalogID: LDC2008E22

Release date: August 20, 2008

Linguistic Data Consortium

Authors: Mohamed Maamouri, Ann Bies, Seth Kulick, Fatma Gaddeche, Wigdan Mekki, Sondos Krouna, Basma Bouziri


1. Introduction

This version of Arabic Treebank 3 part - v3.0 represents the first revision of the ATB3 annotation for the full ATB part 3 (ANNAHAR) corpus. The full ATB3 corpus has been revised according to the new Arabic Treebank annotation guidelines, both manually (all of the syntactic tree annotation) and automatically (the MPG annotation). The revised and updated Arabic Treebank ATB part 3 consists of 599 newswire stories from the An Nahar News Agency (previously released as Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis), LDC Catalog No.:LDC2005T20). This release includes all of the files that were previously released to the GALE community as ATB3(a)-v2.6 and ATB3(ab)-v2.7, with additional quality control added, as well as the remaining ATB3 files. One file from the original ATB3-v2.0 release has been removed from the corpus (ANN20020715.0063), as the text is an exact duplicate of another file in the corpus (ANN20020715.0018), taking the total number of files down from 600 to 599.

In this full ATB3 corpus, there are a total of 339,722 words/tokens before clitics are split and 401,122 words/tokens after clitics are separated for the treebank annotation.

This current release contains the part-of-speech/morphology/gloss annotation and the syntactic treebank annotation of these files. The treebank annotation has been revised in accordance with the new Arabic Treebank Annotation Guidelines. In addition to a partial manual revision, certain automatic changes have been made to the part-of-speech/morphology/gloss tags. A complete manual revision of part-of-speech/morphology/gloss phase of annotation is planned for a future release.

Two papers written about the revision and enhancement process for ATB3 are available on the LDC website:

  • Enhancing the Arabic Treebank: A Collaborative Effort toward New Annotation Guidelines. Mohamed Maamouri, Ann Bies, Seth Kulick. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, May 28-30, 2008. Available: Paper in PDF, Poster
  • Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation. Mohamed Maamouri, Seth Kulick, Ann Bies. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, May 28-30, 2008. Available: Paper in PDF, Poster

2. Annotation

2.1 Tasks and Guidelines

The Arabic Treebank project consists of two distinct phases: (a) Part-of-Speech (=POS) tagging which divides the text into lexical tokens, and gives relevant information about each token such as lexical category, inflectional features, and a gloss (referred to as POS for convenience, although it includes morphological and gloss information not traditionally included with part-of-speech annotation), and (b) Arabic Treebanking (=ArabicTB) which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, etc.

Tim Buckwalter's transliteration system, which we use for this corpus, is described at http://www.ldc.upenn.edu/myl/morph/buckwalter.html.

The revised Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available on the LDC website at http://projects.ldc.upenn.edu/ArabicTreebank/.

2.2 Annotation Process

Tim Buckwalter's morphological analyzer (BAMA) was used to generate a candidate list of POS values for each word/token and our annotators picked the appropriate one manually. (Please note that some words do not exist in this lexicon.) The POS annotation task is just to select the correct POS tag. Once POS is done, we automatically separated the clitics based on the POS selection. We use the following tags for non-Arabic data: NOUN_NUM or ADJ_NUM for numerical data, PUNC for punctuation, and FOREIGN or LATIN for non-Arabic alphabetic data. For the current version, we implemented automatic checks on the part-of-speech tags with consequent further manual revision when necessary to ensure the consistency of the part-of-speech tags with the current guidelines. This is discussed in more detail below, in 7. Data Validation.

The Arabic morphological tagset was then reduced to a smaller POS set, and the files were automatically parsed. The parses were then hand corrected by human annotators.

For this release, the syntactic treebank annotation from ATB3-v2.0 was manually revised according to the new Arabic Treebank Annotation Guidelines. Significant changes were made to NP structure and to classification of verbs with clausal arguments, along with improvements to the annotation in general. Annotators for this process were Wigdan Mekki and Fatma Gaddeche.

The QC process consists of a series of specific searches for several types of potential inconsistency and annotation error. Any errors found in these searches were hand corrected.

3. Source Data Profile

3.1 Data Selection Process

This corpus of Arabic Treebank part 3 - v3.0 consists of 599 newswire stories from the An Nahar News Agency. There are a total of 339,722 words/tokens before clitics are split and 401,122 words/tokens after clitics are separated for the treebank annotation.

One file from the original ATB3-v2.0 release has been removed from the corpus (ANN20020715.0063), as the text is an exact duplicate of another file in the corpus (ANN20020715.0018), taking the total number of files down from 600 to 599.

3.2 Data Sources and Epochs

Source texts were selected from An Nahar News Agency in the GIGAWORD ARABIC TEXT CORPUS published by LDC in 2003. For more details, please see that release. There are 599 stories (specified by the DOC ID), dated on the 15th day of each month ranging from Jan to Dec in 2002.

4. Annotated Data Profile

This corpus of Arabic Treebank part 3 - v3.0 consists of 599 newswire stories from the An Nahar News Agency. There are a total of 339,722 words/tokens before clitics are split and 401,122 words/tokens after clitics are separated for the treebank annotation.

The source file IDs are listed in docs/file.ids.

We have also modified somewhat the format of the data in the various .txt and .tree files, as extensively documented in docs/readme-files.txt.

5. Data Directory Structure

The directory structure for this data (and for the /data directory) is in docs/file.tbl.

In the docs/ directory:

  • ag.dtd, metadata.dtd - These are dtd files for the AG XML.
  • file.ids - A list of file ids in the corpus.
  • file.tbl - Directory structure for everything in this package.
  • readme-files.txt - An extensive description of the modifications that have been made to the format of the data in the various .txt and .tree files.
  • tags-count.txt - A list of the POS/morphological tags after the clitics are separated and after treebank annotation, along with the number of occurrences of each tag.
  • atb3-v3.0-taglist-conversion-to-PennPOS-forrelease.lisp - Lisp code mapping the full morphological tags to a much smaller list, similar to the Penn POS tagset, strictly for convenience.

6. File Format Description

An extensive description of the file formats (and the types of files present for each of the IDs in docs/file.ids) is in docs/readme-files.txt, including a description of the modifications that have been made to the format of the data in the various .txt and .tree files compared with previous releases.

7. Data Validation

The original ATB3-v2.0 data went through the following annotation procedure:

POS procedure:

  • All words went through Tim Buckwalter's morphological analyzer.
  • All words are included in the first pass of POS where annotators select one out of many choices provided by the morphological analyzer.
  • All files went through a second pass of POS annotation where annotators review the annotation done in the previous POS pass.

TB procedure:

  • Words/tokens from the POS annotation are processed to separate clitics in preparation for TB annotation. After clitic separation, the number of words/tokens increases from 339,722 to 401,122.
  • The sentences were pre-parsed to improve productivity.
  • Annotators went through at least two pass of annotation with the help of diagnostic QC searches to catch potential patterns of annotation errors.

For this current ATB3-v3.0 release, the following additional steps were taken:

TB procedure:

  • The syntactic annotation guidelines were significantly revised, in particular with respect to noun phrase structure (idafa), verb phrase structure for verbs taking clausal complements, verb phrase structure for non-inflectional verbs, and the structure surrounding the function words with new tokenization. The revised Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available on the LDC website at http://projects.ldc.upenn.edu/ArabicTreebank/.
  • The treebank annotation from ATB3-v2.0 was manually revised according to the new Arabic Treebank Annotation Guidelines. Significant changes were made to NP structure and to classification of verbs with clausal arguments, along with improvements to the annotation in general.
  • Additional QC searches were run on the full ATB3, including some relating to the relation between POS tags and TB nodes, and the results were hand corrected.

POS procedure:

  • The part-of-speech/morphological guidelines were significantly revised, in particular with respect to the classification and tokenization of closed class function words, classes of nouns (the addition of NOUN_QUANT and NOUN_NUM), classes of adjectives (the addition of ADJ_COMP and ADJ_NUM), and classes of non-inflectional verbs (the addition of PSEUDO_VERB and VERB). The revised Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available on the LDC website at http://projects.ldc.upenn.edu/ArabicTreebank/.
  • We have made various further automatic changes to the POS tags, described below.
  • A limited number of manual corrections were made to the POS tags for this version as well.

The annotators on this project were Fatma Gaddeche (Lead Annotator), Ichraf Amghouz, Luma Ateyah, Basma Bouziri, Fatima Chebchoub, Rachida Fathallah, Tasneem Ghandour, Badia Laadioui, Niama Laadioui, and Wigdan Mekki.

Quality assurance & annotation checking for this release:

The new morphological annotation guidelines describe the possible tokenizations and POS tags for a great number of words in the corpus -- roughly speaking, for all the words except those in the open categories of noun, adjective, proper noun, abbreviation, and so on. For this release of the corpus, our goal was to ensure that all such words (i.e., those not in the open categories) have POS tags that are consistent with the new guidelines.

The notion of ensuring consistency had two phases to it. First, each treebank token mentioned, explicitly or implicitly (e.g., all the ADJ_COMP words, determined by an examination of vowel patterns and then manual filtering) in the morphological guidelines was annotated with a list of its possible tags. For tokens that had only one possible tag, of course the current tag was modified, if necessary, to be that tag. Second, for those words with more than one tag, we focused on some of the most common and important cases, and did manual annotation of those tokens. In some cases, it was also possible to assign the tag based on the tree context, since the tree annotation had already been done.

We describe here the two main aspects of the first phase of the reannotation of the tokens.

1. In some cases, there was a contradiction between the possible POS values for a word and the range of available annotations in the corpus. That is, the morphological annotation for each word in ATB contains information coming from the BAMA, the Buckwalter analyzer, and this includes several pieces of information, together uniquely identifying a token -- the vocalized form, the lemma, and the gloss. The addition or substraction of possible tags for a word raises the question of what lemma and gloss to use for the the new tags, when there are not appropriate entries in the current version of BAMA. We resolved these questions on a case-by-case basis, often modifying or adding entries to BAMA. (And on a more superficial level, the transliterated spellings of token words in the guidelines do not always correspond to the spellings used in the corpus.)

In short, to a large extent, the process of refining the given morphological guidelines with the precision necessary to allow the automatic (to the extent possible) modification of the corpus was the same problem as updating BAMA to reflect the new linguistic analyses. Aside from the quality assurance in the current release, the full benefits of this work will be realized with the future release of an updated edition of the BAMA analyzer.

2. As described elsewhere in this documentation, in the original ATB3-v2.0 phase of annotation the solutions from BAMA for a token were often split into separate tokens for treebank annotation. Often this concerns clear cases such as the prefix "wa" and the pronoun suffixes and so on.

However, there are also other cases in which different decisions are now being taken in the morphological guidelines as to when the original tokens should have been split or not. In order to modify the tokenization to match the current guidelines, it was not possible to do so only by examining individual tokens in the treebank, since such tokens may themselves be part of a large original token. There have been a number of changes as to when certain tokens get split into separate tokens, or kept as one token when previously they were split.

For example, while "limA*A" formerly existed in the ATB3-v2.0 corpus both as a single token and also split into two tokens ("li" and "mA*A"), in the revised morphological guidelines it is now treated as one token only. However, the annotation as it existed in the ATB3-v2.0 corpus for the two-token analysis had already split up the word, and the individual treebank tokens "li" and "mA*A" were both acceptable tokens unto themselves. It was only in the context of being part of a larger original word that it could be recognized that they needed to be merged back together for this revised release.

Therefore, we created a version of the corpus which associated each original token from the .sgm file (i.e., corresponding to the IS_TRANS (input string) word in the data/pos/before-treebank files as described in the readme-files.txt documentation) with the one or more treebank tokens that together make up that original token. We then modified the tokenization based on that correlation. This also helped to identify possible tags in some cases for when the range of POS tags was restricted when part of a larger token (e.g., "mA" in the context of "bi+mA" has a smaller range of possible tags compared to "mA" occurring by itself).

8. DTDs

Two for the AG XML files.

9. Copyright Information

Portions © 2002 An Nahar, © 2003, 2004, 2005, 2007, 2008 Trustees of the University of Pennsylvania

10. Contact Information

Contact info for key project personnel:

  • Mohamed Maamouri, manager and senior researcher, maamouri@ldc.upenn.edu
  • Ann Bies, bies@ldc.upenn.edu
  • Seth Kulick, skulick@ldc.upenn.edu

11. Update Log

This index was updated on August 20, 2008 by Ann Bies.