Arabic Treebank - Broadcast News V1.0
CatalogID: LDC2012T07
Release date: April 16, 2012
Linguistic Data Consortium
Authors: Mohamed Maamouri, Ann Bies, Seth Kulick, Sondos Krouna, Dalila
         Tabassi, Michael Ciul

1. Introduction

This corpus of Arabic Treebank consists of
part-of-speech/morphological annotation and syntactic tree annotation
for 432,976 source tokens (517,080 tree tokens, after clitic
splitting) of transcribed Arabic broadcast news speech.  All of this
data was previously released as subcorpora in earlier versions to the
GALE community; this publication consolidates the Arabic Treebank
broadcast news data, and includes final corrections to some tokens and
part-of-speech tags.

This publication contains part-of-speech/morphology/gloss annotation
and syntactic treebank annotation that is in accordance with the Penn
Arabic Treebank (PATB) Morphological and Syntactic Annotation
Guidelines, available on the LDC website at
http://projects.ldc.upenn.edu/ArabicTreebank/.  These are the same
annotation guidelines used for the updated and revised newswire
corpora that have been recently released (Arabic Treebank Part 1 -
V4.1, CatalogID: LDC2010T13; Arabic Treebank Part 2 v 3.1, CatalogID:
LDC2011T09; Arabic Treebank part 3 - v3.2, CatalogID: LDC2010T08).

The transcription of this broadcast news data was produced by LDC.

This release conforms to the format conventions initiated with the
releases of Arabic Treebank part 5 - v1.0, LDC2009E72 (ATB5) and
Arabic Treebank Part 6 V1.0 - GALE Phase 4 dev09, LDC2009E108 (ATB6),
which are detailed in docs/readme-files.txt and in the
docs/KulickBiesMaamouri-LREC2010.pdf paper:

Consistent and Flexible Integration of Morphological Annotation
in the Arabic Treebank. Seth Kulick, Ann Bies and Mohamed
Maamouri. In Proceedings of the Seventh International Conference on
Language Resources and Evaluation (LREC 2010), Malta May 19-21,
2010.  Available: docs/KulickBiesMaamouri-LREC2010.pdf

In addition, two papers written about the revision and enhancement
process of the newswire corpora that resulted in the revised
annotation guidelines are available on the LDC website:

Enhancing the Arabic Treebank: A Collaborative Effort toward
New Annotation Guidelines. Mohamed Maamouri, Ann Bies, Seth
Kulick.  In Proceedings of the Sixth International Conference on
Language Resources and Evaluation (LREC 2008), Marrakech, Morocco,
May 28-30, 2008.  Available:
Paper: http://papers.ldc.upenn.edu/LREC2008/Enhancing_Arabic_Treebank.pdf
Poster: http://papers.ldc.upenn.edu/LREC2008/Enhancement-poster.ppt

Diacritic Annotation in the Arabic Treebank and its Impact on
Parser Evaluation. Mohamed Maamouri, Seth Kulick, Ann Bies.  In
Proceedings of the Sixth International Conference on Language
Resources and Evaluation (LREC 2008), Marrakech, Morocco, May
28-30, 2008.  Available:
Paper: http://papers.ldc.upenn.edu/LREC2008/Diacritic_Annotation_ATB.pdf
Poster: http://papers.ldc.upenn.edu/LREC2008/Diacritization-poster.ppt

This corpus is part of an on-going effort to produce parallel Arabic
and English Treebanks at LDC.  The files in this release are parallel
with the same file IDs in the English Translation Treebank Broadcast
News corpus, which will be published in the near future (the subcorpora
of which are currently available to the GALE community).

This work was supported in part by the Defense Advanced Research
Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content
of this publication does not necessarily reflect the position or the
policy of the Government, and no official endorsement should be
inferred.

2. Annotation

2.1 Tasks and Guidelines

The Arabic Treebank project consists of two distinct phases: (a)
Part-of-Speech (=POS) tagging which divides the text into lexical
tokens, and gives relevant information about each token such as
lexical category, inflectional features, and a gloss (referred to as
POS for convenience, although it includes morphological and gloss
information not traditionally included with part-of-speech
annotation), and (b) Arabic Treebanking (=ArabicTB) which
characterizes the constituent structures of word sequences, provides
categories for each non-terminal node, and identifies null elements,
co-reference, traces, etc.

Tim Buckwalter's transliteration system, which we use for this corpus,
is described at http://www.qamus.org/transliteration.htm.

The revised Penn Arabic Treebank (PATB) Morphological and Syntactic
Annotation Guidelines are available on the LDC website at
http://projects.ldc.upenn.edu/ArabicTreebank/.

Guidelines for the annotation of speech data (based on the English
Treebank Switchboard Guidelines for speech effects) are also
available at http://projects.ldc.upenn.edu/ArabicTreebank/.

2.2 Annotation Process

The LDC Standard Arabic Morphological Analyzer LDC2009E73 (SAMA 3.1)
was used to generate a candidate list of POS values for each
word/token and our annotators picked the appropriate one manually.
(Please note that some words do not exist in this lexicon.)  The POS
annotation task is to select the correct POS tag.  Once POS is done,
we automatically separate the clitics based on the POS selection.  We
use the following tags for non-Arabic data: NOUN_NUM or ADJ_NUM for
numerical data, PUNC for punctuation, and FOREIGN or LATIN for
non-Arabic alphabetic data.  Dialect forms are given the POS tag
DIALECT.

We then implemented automatic checks on the part-of-speech tags with
consequent further manual revision when necessary to ensure the
consistency of the part-of-speech tags with the current guidelines.

The Arabic morphological tagset was then reduced to a smaller POS set,
and the files were automatically parsed.  The parses were then hand
corrected by human annotators according to the Arabic Treebank
Annotation Guidelines.

The QC process consists of a series of specific searches for several
types of potential inconsistency and annotation error.  Any errors
found in these searches were hand corrected.

The annotators for this project were Luma Ateyah, Olfa Bayouth, Maha
Ben Hadj Aleya, Sameh Benna, Asma Berrima, Basma Bouziri, Faiez Dhieb,
Rachida Fathallah, Fatma Gaddeche, Esma Maamouri Ghrib, Aicha Graja,
Sondos Krouna, Badia Laadioui, Leila Laghrissi, Soumeya Mekki, Wigdan
Mekki, Mouna Rezig, and Dalila Tabassi.

3. Source Data Profile

3.1 Data Selection Process

This corpus of Arabic Treebank - Broadcast News consists of 120
broadcast news (BN) stories.  There are a total of 432,976 source
tokens before clitics are split and 517,080 tree tokens after clitics
are separated for the treebank annotation.

This corpus is part of an on-going effort to produce parallel Arabic
and English Treebanks at LDC.  The files in this release are parallel
with the same file IDs in the English Translation Treebank Broadcast
News corpus, which will be published in the near future (the subcorpora
of which are currently available to the GALE community).

3.2 Data Sources and Epochs

The data consists of Arabic broadcast news stories dating from
2005-2008 (from Abu Dhabi TV, Al Alam News Channel, Al Arabiya, Al
Baghdadya News, Al Fayha, Al Hurra, Al Iraqiyah, Aljazeera, Al
Ordiniyah, Al Sharqiya, Dubai News, Iraqiyah, Kuwait TV, Oman TV, PAC,
Ltd., Saudi TV, Sawa News, and Syria TV).

4. Annotated Data Profile

This data consists of 120 broadcast news (BN) stories.  There are a
total of 432,976 source tokens before clitics are split and 517,080
tree tokens after clitics are separated for the treebank annotation,
all of which have been annotated for morphology/part-of-speech and
syntactic structure.

The source file IDs are listed in docs/file.ids.  A listing of all of
the files in this release can be found in docs/file.tbl.

The data formats, including the integrated format, are documented in
docs/readme-files.txt and in the paper at
docs/KulickBiesMaamouri-LREC2010.pdf.

5. Data Directory Structure

The directory structure for this data (and for the /data directory) is
in docs/file.tbl.

In the docs/ directory:

- KulickBiesMaamouri-LREC2010.pdf    - Paper describing data formats and the 
     integration of Treebank and SAMA tokens.
- ag-1.1.dtd                         - This is the dtd file for the AG XML.
- errata.txt                         - Lists errata and the URL where any 
     additional errata from this corpus will be listed.
- file.ids                           - A list of file ids in the corpus.
- file.tbl                           - Directory structure for everything in 
     this package.
- readme-files.txt                   - An extensive description of the new 
     integrated format, also describes the various formats of the data.
- tags-count.txt                     - A list of the POS/morphological tags 
     after the clitics are separated and after treebank annotation, along 
     with the number of occurrences of each tag.
- atb-bn-taglist-conversion-to-PennPOS-forrelease.lisp - Lisp code 
     mapping the full morphological tags to a much smaller list, similar to 
     the Penn POS tagset, strictly for convenience.

6. File Format Description

A description of the file formats (and the types of files present for
each of the IDs in docs/file.ids) is in docs/readme-files.txt and in
docs/KulickBiesMaamouri-LREC2010.pdf, including a description of the
modifications that have been made to the format of the data in the
various .txt and .tree files compared with ATB releases prior to ATB5
and detailed information on the integrated format.

7. Data Validation

The data went through the following annotation procedure:

POS procedure:

- All words went through the morphological analyzer.
- All words are included in the first pass of POS where annotators select 
  one out of many choices provided by the morphological analyzer.
- All files went through a second pass of POS annotation where annotators 
  review the annotation done in the previous POS pass.

TB procedure:

- Words/tokens from the POS annotation are processed to separate clitics 
  in preparation for TB annotation. After clitic separation, the number 
  of words/tokens increases from 432,976 to 517,080.
- The sentences were pre-parsed to improve productivity, and the parses 
  were then hand corrected.
- All files went through a second pass of TB annotation where annotators
  review the annotation done in the previous TB pass.
- Annotators went through a final pass of annotation with the help of 
  diagnostic QC searches to catch potential patterns of annotation errors.

Quality assurance & annotation checking for this release:

Every token in the treebank has been explicitly tested against the
possible SAMA 3.1 solutions for that token.  Where there have been
discrepancies, the results have been manually inspected and in many
cases changed.  See docs/readme-files.txt for an detailed description
of this procedure and docs/errata.txt for a listing of some of the
remaining discrepancies between the treebank and SAMA.

The Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation
Guidelines are available on the LDC website at
http://projects.ldc.upenn.edu/ArabicTreebank/.

8. DTDs

One for the AG XML files, ag-1.1.dtd.

9. Copyright Information

Portions (c) 2009-2012 Trustees of the University of Pennsylvania

10. Contact Information

Contact info for key project personnel: 

Mohamed Maamouri, manager and senior researcher, maamouri@ldc.upenn.edu
Ann Bies, bies@ldc.upenn.edu
Seth Kulick, skulick@ldc.upenn.edu

11. Update Log

This index was updated on April 16, 2012 by Ann Bies.