BOLT Egyptian Arabic Treebank - Discussion Forum V1.0
CatalogID: 
Release date: January 9, 2017
Linguistic Data Consortium
Authors: Mohamed Maamouri, Ann Bies, Seth Kulick, Sondos Krouna, Dalila
         Tabassi, Michael Ciul

1. Introduction

This corpus of Egyptian Arabic Treebank consists of
part-of-speech/morphological annotation and syntactic tree annotation
for 400,448 source tokens (508,548 tree tokens, after clitic
splitting) of Egyptian Arabic (ARZ) in 730 files of discussion forum
text from various sources.

Most of this data was previously released as subcorpora in earlier
versions to the BOLT community; this publication consolidates the
Egyptian Arabic Treebank discussion forum (DF) data.  The corpora that
were released to the BOLT community previously had the catalog numbers
LDC2012E93 (ARZ-DF Part 1), LDC2012E98 (ARZ-DF Part 2), LDC2012E89
(ARZ-DF Part 3), LDC2012E99 (ARZ-DF Part 4), LDC2012E107 (ARZ-DF Part
5), LDC2012E125 (ARZ-DF Part 6), LDC2013E12 (ARZ-DF Part 7), and
LDC2013E21 (ARZ-DF Part 8).

This combined release also updates the synchronization of the tokens
in the corpus with the morphological analyzers: SAMA 3.1 Morphological
Analyzer (LDC2010L01), for the Modern Standard Arabic (MSA) tokens,
and the CALIMA v0.5 Morphological Analyzer, for the Egyptian Arabic
(ARZ) tokens.  85.3% of the ARZ source tokens in this combined corpus
(294641/345273) are a complete match with CALIMA v0.5.  Details can be
found in docs/readme-files.txt in this release.

This publication contains part-of-speech/morphology/gloss annotation
and syntactic treebank annotation that is in accordance with the Penn
Arabic Treebank (PATB) annotation guidelines.  The Penn Arabic
Treebank MSA Morphological and Syntactic Annotation Guidelines are
both available in the docs directory of this release
(docs/ATB-POSGuidelines-v3.8.pdf and
docs/ATB-SyntacticGuidelines-v4.95-20110630.pdf).  These are the same
annotation guidelines used for the recent updated and revised newswire
and broadcast news corpora that have been released (Arabic Treebank
Part 1 - V4.1, CatalogID: LDC2010T13; Arabic Treebank Part 2 v 3.1,
CatalogID: LDC2011T09; Arabic Treebank part 3 - v3.2, CatalogID:
LDC2010T08; Arabic Treebank - Broadcast News v1.0, CatalogID:
LDC2012T07).  The LDC Egyptian Arabic Treebank Morphological and
Syntactic Annotation Guidelines are also available in the docs
directory of this release (docs/ARZ-POSGuidelines-v1.2.pdf and
docs/ARZ-SyntacticGuidelines-v0.2.pdf).  The Egyptian Arabic
annotation guidelines target informal data, especially SMS/Chat data,
but apply to informal data such as discussion forum data as well.

Due to the nature of this Egyptian Arabic corpus, the relationship
between the source tokens and the morphological analyzer is now more
complicated, containing references to both the SAMA 3.1 Morphological
Analyzer (LDC2010L01), for the MSA tokens, and the CALIMA v0.5
Morphological Analyzer, for the ARZ tokens.  The POS annotation was
done simultaneously with development of the morphological
analyzer. Therefore some inevitable inconsistencies resulted in the
previously released BOLT e-corpora data between the
part-of-speech/vocalization/lemma solutions and morphological analyzer
solutions.  These are now reconciled in this combined release.
Detailed information about the correspondence can be found in
docs/readme-files.txt.

This release conforms to the format conventions initiated with the
releases of Arabic Treebank part 5 - v1.0, LDC2009E72 (ATB5) and
Arabic Treebank Part 6 V1.0 - GALE Phase 4 dev09, LDC2009E108 (ATB6),
which are detailed in docs/readme-files.txt and in the
docs/KulickBiesMaamouri-LREC2010.pdf paper:

Consistent and Flexible Integration of Morphological Annotation in the
Arabic Treebank. Seth Kulick, Ann Bies and Mohamed Maamouri. In
Proceedings of the Seventh International Conference on Language
Resources and Evaluation (LREC 2010), Malta May 19-21, 2010.
Available: docs/KulickBiesMaamouri-LREC2010.pdf
(and also available on the LDC website at
https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2010-consistent-flexible-integration.pdf)

Two papers written about the revision and enhancement process of the
newswire corpora that resulted in the revised ATB annotation
guidelines are available on the LDC website:

Enhancing the Arabic Treebank: A Collaborative Effort toward
New Annotation Guidelines. Mohamed Maamouri, Ann Bies, Seth
Kulick.  In Proceedings of the Sixth International Conference on
Language Resources and Evaluation (LREC 2008), Marrakech, Morocco,
May 28-30, 2008.  Available:
Paper: https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2008-enhancing-arabic-treebank.pdf

Diacritic Annotation in the Arabic Treebank and its Impact on
Parser Evaluation. Mohamed Maamouri, Seth Kulick, Ann Bies.  In
Proceedings of the Sixth International Conference on Language
Resources and Evaluation (LREC 2008), Marrakech, Morocco, May
28-30, 2008.  Available:
Paper: https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2008-diacritic-annotation-atb.pdf

In addition, a paper written about the development of the Egyptian
Arabic Treebank is also available on the LDC website:

Developing an Egyptian Arabic Treebank: Impact of Dialectal Morphology
on Annotation and Tool Development. Mohamed Maamouri, Ann Bies, Seth
Kulick, Michael Ciul, Nizar Habash and Ramy Eskander. In Proceedings
of LREC 2014: 9th Edition of the Language Resources and Evaluation
Conference, Reykjavik, May 26-31. Available:
Paper: https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2014-developing-egyptian-arabic-treebank.pdf

This material is based upon work supported by the Defense Advanced 
Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. 
The content does not necessarily re.ect the position or the policy of 
the Government, and no official endorsement should be inferred.

2. Annotation

2.1 Tasks and Guidelines

The Arabic Treebank project consists of two distinct phases: (a)
Part-of-Speech (=POS) tagging which divides the text into lexical
tokens, and gives relevant information about each token such as
lexical category, inflectional features, and a gloss (referred to as
POS for convenience, although it includes morphological and gloss
information not traditionally included with part-of-speech
annotation), and (b) Arabic Treebanking (=ArabicTB) which
characterizes the constituent structures of word sequences, provides
categories for each non-terminal node, and identifies null elements,
co-reference, traces, etc.

Tim Buckwalter's transliteration system, which we use for this corpus,
is described at http://www.qamus.org/transliteration.htm.

The revised Penn Arabic Treebank (PATB) Morphological and Syntactic
Annotation Guidelines are available in the docs directory of this
release.  The LDC Egyptian Arabic Treebank Morphological and Syntactic
Annotation Guidelines are also available in the docs directory of this
release.

2.2 Annotation Process

The original morphological/POS annotation for the BOLT e-corpora was
done using SAMA 3.1 as the morphological analyzer for the MSA tokens
for all of the subcorpora.  The morphological analyzer for the ARZ
tokens was CALIMA v0.3 for the original BOLT subcorpora parts 1-6, and
CALIMA v0.4.2 for the original BOLT subcorpora parts 7-8.  The change
in ARZ analyzer is because the annotation and the development of the
Egyptian Arabic morphological analyzer were taking place in parallel,
and the later subcorpora took advantage of further development in the
analyzer.

Once the treebank annotation and CALIMA development were both
complete, it was possible to reconcile the annotation to be as
consistent as possible with the latest version of the analyzer,
CALIMA v0.5, which has been done for the data in this release.

Both the LDC Standard Arabic Morphological Analyzer (LDC2010L01, SAMA
3.1), for the MSA tokens, and the CALIMA Morphological Analyzer (v0.3
or v0.4.2, depending on the subcorpus), for the ARZ tokens, were used
to generate a candidate list of POS values for each word/token.  Our
annotators picked the appropriate one manually, or else manually
supplied segmentation and POS information if neither analyzer
contained the solution for the token.

Due to the nature of this Egyptian Arabic corpus (primarily ARZ, but
unavoidably including also some MSA), the relationship between the
source tokens and the morphological analyzer is more complicated than
for the entirely MSA ATB corpora, since this Egyptian Arabic Treebank
corpus contains references to both the SAMA 3.1 Morphological Analyzer
(LDC2010L01), for the MSA tokens, and the CALIMA v0.5 Morphological
Analyzer, for the ARZ tokens.  The POS annotation was done
simultaneously with development of the morphological
analyzer. Therefore some inevitable inconsistencies resulted in the
previously released BOLT e-corpora data between the
part-of-speech/vocalization/lemma solutions and morphological analyzer
solutions.

These are now reconciled in this combined release, and this release
also updates the synchronization of the tokens in the corpus with the
morphological analyzers: SAMA 3.1 Morphological Analyzer (LDC2010L01),
for the Modern Standard Arabic (MSA) tokens, and the CALIMA v0.5
Morphological Analyzer, for the Egyptian Arabic (ARZ) tokens.  85.3%
of the ARZ source tokens in this combined corpus (294641/345273) are a
complete match with CALIMA v0.5.  Details can be found in
docs/readme-files.txt in this release.

We use the following tags for non-Arabic data: NOUN_NUM or ADJ_NUM for
numerical data, PUNC for punctuation, and FOREIGN or LATIN for
non-Arabic alphabetic data.

We then implemented automatic checks on the part-of-speech tags with
consequent further manual revision when necessary to ensure the
consistency of the part-of-speech tags with the current guidelines.
The morphological tagset was then reduced to a smaller POS set, to
facilitate future automatic dialectal parsing.

Once POS annotation was complete, we automatically separated the
clitics based on the POS selection.  Human annotators provided full
syntactic trees manually, according to the MSA and ARZ Treebank
Annotation Guidelines.

The QC process consisted of a series of specific searches for several
hundred types of potential inconsistency and annotation error.  Any
errors found in these searches were hand corrected in two passes.

The annotators for this project were Nancy Abdelhalim, Olfa Bayouth,
Maha Ben Hadj Aleya, Sameh Benna, Asma Berrima, Faiez Dhieb, Seham El
Kareh, Soha Sobhy Ali Abd El-Raheem, Radwa Essam Abd Elmonaem Elsawy,
Omnia Abdelmonem Elsayed, Rachida Fathallah, Fatma Gaddeche, Esma
Maamouri Ghrib, Aicha Graja, Nadia Hamrouni, Nermine Khalil, Nawred
Khazri, Sondos Krouna, Badia Laadioui, Leila Laghrissi, Omnia Taha
Mahfouz, Reham Mohamed Marzouk, Soumeya Mekki, Fatma Elaaty Mohamed,
Reem Nabil Mohammed, Sherine Hassan Mustapha, Mouna Rezig, Mahytab
Mohammed Abbas Shouman, and Dalila Tabassi.

3. Source Data Profile

3.1 Data Selection Process

This corpus of Egyptian Arabic Treebank Discussion Forum consists of
730 files of discussion forum text from various sources.  There are a
total of 400,448 source tokens before clitics are split and 508,548
tree tokens, after clitics are separated for the treebank annotation.
All of this data has been annotated for morphology/part-of-speech and
syntactic structure.

The files selected for this treebank corpus were chosen from the files
of Egyptian Arabic (ARZ) discussion forum that had already undergone
SU annotation at LDC.

3.2 Data Sources and Epochs

The data consists of Arabic discussion forum text from various sources
collected by LDC.

4. Annotated Data Profile

This data consists of 730 files of discussion forum text from various
sources.  There are a total of 400,448 source tokens before clitics
are split and 508,548 tree tokens, after clitics are separated for the
treebank annotation.  All of this data has been annotated for
morphology/part-of-speech and syntactic structure.

5. Data Directory Structure

The source file IDs are listed in docs/file.ids.  A listing of all of
the files in this release can be found in docs/file.tbl. The data
formats, including the integrated format, are documented in
docs/readme-files.txt.

In the data/ directory:

- integrated/ - The goal of this format is to bring together in one
     place: 1) the information about the source tokens from the
     pos/before files, including the explicit mapping between the
     source and tree tokens, 2) the information about the tree tokens
     from the pos/after files, 3) the tree structure. For details about
     this content, see docs/readme-files.txt.
- penntree/ - the annotation files in Penn Treebank bracketed list
     style.
- pos/      - the POS annotation for this corpus.  For details about
     this content, see docs/readme-files.txt.
- su_xml/   - the SU annotated files used to supply the source data
     and tokens as input for the POS annotation.
- tdf/      - the SU files converted into the .tdf format necessary
     for the operation of the POS annotation tool.
- xml/      - the annotation graph files, in the format used by our
     syntactic annotation tool.

In the docs/ directory:

- ag-1.1.dtd                         - This is the dtd file for the AG XML.
- ATB-POSGuidelines-v3.8.pdf         - Morphological and part-of-speech
     annotation guidelines.
- ATB-SyntacticGuidelines-v4.95-20110630.pdf - Syntactic annotation
     guidelines.
- atb-arz-df-taglist-conversion-to-PennPOS-forrelease.lisp - Lisp code 
     mapping the full morphological tags to a much smaller list, similar to 
     the Penn POS tagset, strictly for convenience.
- file.ids                           - A list of file ids in the corpus.
- file.tbl                           - Directory structure for everything in 
     this package.
- KulickBiesMaamouri-LREC2010.pdf    - Paper describing data formats and the 
     integration of Treebank and SAMA tokens.
- not-included.txt - A listing of character sequences from the source
     files that are not included as tokens.  See readme-files.txt for
     further explanation.
- readme-files.txt - Additional details about the data and data formats,
     including information about the data/pos/before content,
     data/integrated files and data/tdf files, along with information about
     the relationship with SAMA.
- tags-count.txt   - A list of the POS/morphological tags 
     after the clitics are separated and after treebank annotation, along 
     with the number of occurrences of each tag.
- token-mapping.txt - A mapping making explicit the linkage between the
     annotation files, the .tdf files, and the .su.xml file.  See
     readme-files.txt for details.

6. File Format Description

A description of the file formats (and the types of files present for
each of the IDs in docs/file.ids) is in docs/readme-files.txt and in
docs/KulickBiesMaamouri-LREC2010.pdf, including a description of the
modifications that have been made to the format of the data in the
various .txt and .tree files compared with ATB releases prior to ATB5
and detailed information on the integrated format.

7. Data Validation

The data went through the following annotation procedure:

POS procedure:

- All words were submitted to the morphological analyzers.  (Note that
  for some tokens, there was no solution in either analyzer; most are
  addressed by the last step below.)
- All words were then included for POS annotation, where annotators either
  selected one out of many choices provided by the morphological
  analyzers, or reviewed the annotation done in a previous POS pass.
- Tokens with no solution in either SAMA or CALIMA were annotated using
  a new "wildcard" feature in the annotation tool that allows annotators
  to supply annotation for a stem that is not in the analyzer in
  accordance with the CALIMA/SAMA scheme.

TB procedure:

- Words/tokens from the POS annotation are processed to separate clitics
  in preparation for TB annotation. After clitic separation, the number
  of words/tokens increases from 400,448 to 508,548.
- All sentences were manually annotated for syntactic structure.
- Annotators went through a stage of annotation with the help of
  diagnostic QC searches to catch potential patterns of annotation errors.

Quality assurance & annotation checking for this release:

Every token in the treebank has been explicitly tested against the
possible SAMA 3.1 and CALIMA v0.5 solutions for that token.  See
docs/readme-files.txt for a detailed analysis of the relationship with
the CALIMA and SAMA analyzers.

The Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation
Guidelines are available in the docs/ directory of this release.  The
LDC Egyptian Arabic Morphological and Syntactic Annotation Guidelines
are also available in the docs/ directory of this release.

8. DTDs

One for the AG XML files, ag-1.1.dtd, located both in docs/ and with
the .xml files in data/xml/treebank/.

9. Copyright Information

Portions (c) 2011-2017 Trustees of the University of Pennsylvania

10. Contact Information

Contact info for key project personnel: 

Mohamed Maamouri, manager and senior researcher, maamouri@ldc.upenn.edu
Ann Bies, bies@ldc.upenn.edu
Seth Kulick, skulick@ldc.upenn.edu

11. Update Log

This index was updated on January 9, 2017 by Ann Bies.