This release follows earlier treebank releases in that
we include information indicating the relationship with the
morphological analyzer. However, due to the nature of this corpus,
this relationship is now more complicated, containing references
to both the SAMA 3.1 Morphological Analyzer (LDC2010L01), for the
MSA tokens, and the CALIMA v0.5 Morphological Analyzer, for the
ARZ tokens.  This is discussed in more detail below.

For more information on CALIMA, see
Nizar Habash, Ramy Eskander, and Abdelati Hawwari
A Morphological Analyzer for Egyptian Arabic
Special Interest Group on  Computational Morphology and Phonology, 2012

Briefly, "source tokens" are the whitespace/punctuation-delimited
tokens (offset annotation) on the source text that receive a
morphological analysis through the SAMA analyzer.  The "tree tokens"
result from splitting up these source tokens into subsequences as
appropriate for the annotation of syntactic structure.

In this release, there are 153,171 source tokens and 182,965 tree tokens.

This terminology, along with information about the UNVOCALIZED and
INPUT STRING forms, is discussed more in this paper, included in this
release:

Consistent and Flexible Integration of Morphological Annotation in
the Arabic Treebank
Seth Kulick, Ann Bies, Mohamed Maamouri
LREC 2010
http://papers.ldc.upenn.edu/LREC2010/KulickBiesMaamouri-LREC2010.pdf

The annotation in this release was done simultaneously with 
development of the morphological analyzer. Therefore there are
some inevitable inconsistencies existing in the data between
the part-of-speech/vocalization/lemma solutions and morphological
analyzer solutions. 

==================================================================
Contents:
1. su.xml files converted tdf files for annotation
2. Solution status field
3. File extensions and directory structure
4. Description of the "Integrated" format
5. Additional Information
5a. data/xml/pos/FILE.xml not included
5b. A note about multiple trees on one line
5c. Non-ASCII punctuation characters
==================================================================

==================================================================
1. su.xml files converted tdf files for annotation
==================================================================

The source files used for this corpus are the xml files included
in the data/su_xml directory.  For purposes of POS annotation, these
were converted to the .tdf files included in the data/tdf directory.
The conversion used a simple script that searched for <su> tags and
captured two pieces of data: the "id" attribute and the text content
of the element. The "id" element was placed in the "file" column of
the TDF file and the text content was placed in the "transcript"
column. Other columns of the TDF were filled with predictable data.
Leading whitespace in the text from su.xml was stripped for inclusion
in the tdf file.

==================================================================
2. Solution status field
==================================================================

Each file in pos/before has a set of entries for each source token.
One entry is the STATUS field. This consists of 6 fields of
information:

has_sol: T if the token has a solution, F if it doesn't (i.e., if the
POS tag is NO_FUNC).  If has_sol=F, then all the following fields are
set to . (period), except for orig, which remains as ARZ, although it
can be ignored in this case.

excluded: T if the token is such that it is not checked to have a
matching solution in the tables for either analyzer; F if it is not
checked.  If excluded=T, then all the following fields are set to
. (period), except for orig, which remains as ARZ, although it can be
ignored in this case.

orig: ARZ for Egyptian Arabic, or MSA for Modern Standard Arabic.

sama: T if the token's annotation exactly matches a solution in the
SAMA analyzer, and F otherwise.  Here "exactly matches" means that
there is a SAMA solution with matching POS, VOC, LEMMA, and
UNSPLIT_VOC (GLOSS is not checked).

calima_all: T if the token's annotation exactly matches a solution in
the CALIMA analyzer, and F otherwise.  Here "exactly matches" means
that there is a CALIMA solution with matching POS, VOC, LEMMA, and
UNSPLIT_VOC (GLOSS is not checked).

calima_pv: Three possible values.  If calima_all=T, then
calima_pv=.(period).  Otherwise, calima_pv=T if the token's annotation
matches a solution in the CALIMA analyzer for POS and VOC (i.e.,
LEMMA, UNSPLIT_VOC, and GLOSS are not checked), and calima_pv=F if
not.

The tokens overall are characterized as follows:
no solution                    582 (has_sol=F)
excluded                     29638(has_sol=T excluded=T)
MSA                            424 (has_sol=T excluded=F orig=MSA)
ARZ                         122527 (has_sol=T excluded=F orig=ARZ)
total # tokens              153171

The excluded tokens are for POS tags such as PUNC, TYPO, NOUN_NUM and
ADJ_NUM if the latter two consist of all digits, etc.

The MSA tokens are categorized as follows:
sama=T          424 (has_sol=T excluded=F orig=MSA sama=T)
sama=F            0 (has_sol=T excluded=F orig=MSA sama=F)
               ----
                424

The ARZ tokens are categorized as follows:
calima_all=T     100895  (has_sol=T excluded=F orig=ARZ calima_all=T)
calima_pv=T        7649  (has_sol=T excluded=F orig=ARZ calima_all=F calima_pv=T)
calima_pv=F       13983  (has_sol=T excluded=F orig=ARZ calima_all=F calima_pv=F)
                 ------
                 122527

============================================================================
3. File extensions and directory structure
============================================================================

Each FILE in docs/file.ids has a corresponding file in the
following directories.

data/su_xml/FILE.su.xml  (utf-8)
    su.xml file (see Section 1)

data/tdf/FILE.tdf  (utf-8)
    tdf file (see Section 1)

data/pos/before/FILE.txt (utf-8)
   Information about the "source tokens" used for analysis
   with SAMA.  So this is a listing of each token before
   clitic-separation.
   Each token contains the following information:
-----------------------------------------------------------
INPUT STRING: (utf-8 characters from .tdf file)
    IS_TRANS: (Buckwalter transliteration of previous, used for input to
               CALIMA and SAMA)
       INDEX: (automatically assigned index, based on paragraph&word#)
     OFFSETS: (start-end - pair of integers - offset into
                            tdf file, corresponding to the INPUT STRING)
      TOKENS: (start-end - two indices indicating the tree tokens in the 
                            corresponding pos/after/FILE.txt file
                            that correspond to this source token)
      STATUS: (the status of this solution with respect to the analyzers,
               as discussed in Section 2)
     COMMENT: (a comment associated with this source token)
       LEMMA: (the lemma associated with this source token and solution in
              CALIMA or SAMA)
  UNSPLITVOC: (the vocalized form (not separated into segments) of the source
              token solution, from CALIMA or SAMA)
         POS: (POS for this source token)
         VOC: (vocalization for this source token)
       GLOSS: (gloss for this source token)
-----------------------------------------------------------
   The POS, VOC, and GLOSS fields are redundant with the respective values
   of the corresponding tree tokens. 

data/xml/treebank/FILE.xml
   The Annotation Graph .xml file for the tree tokens.

data/pos/after/FILE.txt
   Information about each tree token in the corresponding
   xml/treebank FILE.xml file.
   Each token contains the following information:
-----------------------------------------------------------
INPUT STRING: (utf-8 characters from .tdf file)
    IS_TRANS: (Buckwalter transliteration of previous)
     COMMENT: (annotator comment about word)
       INDEX: (automatically assigned index, based on paragraph&word#)
     OFFSETS: (start,end - pair of integers - Annotation Graph offset into 
                            tdf file, corresponding to the INPUT STRING)
 UNVOCALIZED: (the unvocalized form of the word)
   VOCALIZED: (the vocalized form of the word, taken from the solution)
         POS: (the pos tag, taken from the solution)
       GLOSS: (the gloss, taken from the solution)
-----------------------------------------------------------
   For more information about the derivation of the INPUT STRING and
   UNVOCALIZED fields for clitic separated tree tokens, see the paper
   "Consistent and Flexible Integration of Morphological Annotation in
   the Arabic Treebank", mentioned at the beginning of this readme.

data/penntree/without-vowel/FILE.tree
   Penn Treebanking style output, generated from the
   integrated file.  Each terminal is of the form
   (pos word), where pos and word correspond to the POS and UNVOCALIZED
   values for the corresponding token in pos/after/FILE.txt,
   respectively.

data/penntree/with-vowel/FILE.tree
   Penn Treebanking style output, generated from the
   integrated file. Each terminal is of the form
   (pos word), where pos and word correspond to the POS and VOCALIZED
   values for the corresponding token in pos/after/FILE.txt,
   respectively.

data/integrated/FILE.txt
   See section 4 for a description of the integrated format.

============================================================================
4. Description of the "Integrated" format
============================================================================

The goal of this format is to bring together in one place:
1) the information about the source tokens from the pos/before files,
including the explicit mapping between the source and tree tokens.
2) the information about the tree tokens from the pos/after files
3) the tree structure

The basic format of each file is:

FILEPREFIX:
file metadata (beginning with ;;)

CHUNK: filename:chunk#
chunk metadata (beginning with ;;)
#source tokens: #
#tree tokens: #
#trees: 1
listing of source tokens
listing of tree tokens

TREE: filename:chunk#:tree#:#_tokens_in_tree
tree with W# instead of the tree tokens

with the CHUNK,TREE sections repeated.

(The file and chunk metadata is taken directly from the Annotation Graph file
and can be ignored or used as the user likes.)

Each CHUNK corresponds to one "Paragraph" in the usual release
terminology.  (It is possible that in some versions of the treebank
more than one tree may be associated with a paragraph, which is why
there is a slot for "# trees" after the # of source tokens, and in the
TREE: line. In this release it is always 1 tree per chunk, however,
with TOP wrapped around it.)

Each source token row consists of the following 17 items, separated by the
character U+00B7:

1) s:# - the source token #
2) the source token text in utf8, corresponding to the source text.
3) the source token text, in Buckwalter transliteration
4) the starting offset
5) the ending offset
6) the start index of the corresponding tree token(s)
7) the end index of the corresponding tree token(s)
8) a field reserved for future use, which should be ignored
9) a field reserved for future use, which should be ignored
10) a field reserved for future use, which should be ignored
11) a field reserved for future use, which should be ignored
12) a field reserved for future use, which should be ignored
13) an annotator comment for this source token
14) the status of this token with respect to SAMA, as discussed in Section 1
above.  This is encoded as a 6 character string:
   character 1 is "has_sol"
   character 2 is "excluded"
   character 3 is "orig" (A for ARZ, M for MSA)
   character 4 is "sama"
   character 5 is "calima_all"
   character 6 is "calima_pv"   
15) The lemma for this source token
16) The unsplit vocalization for this source token
17) A status indicating whether this source token is mapped to corresponding
   tree tokens.  (All "OK")

These fields correspond to the information in the pos/before files as
follows:

1)   <-> INDEX
     (here counting from 0, in the pos/before file counting from 1)
2)   <-> INPUT STRING
3)   <-> IS_TRANS
4,5) <-> OFFSETS
6,7) <-> TOKENS
     (here counting from 0, in the pos/before file counting from 1)
14)  <-> STATUS
15)  <-> LEMMA
16)  <-> UNSPLITVOC

Each tree token row consists of the following 13 items:

1) t:# - the tree token #
2) POS tag
3) "f" or "t" - a boolean indicating whether this token was split
    from the previous tree token
4) "f" or "t" - a boolean indicating whether this token was split
    from the following tree token
5) vocalized form
6) gloss
7) offset start
8) offset end
9) text in utf8, corresponding to the source .tdf text
10) unvocalized form
11) comment

These fields correspond to the information in the pos/after files as
follows:

1)     <-> INDEX
       (here counting from 0, in the pos/after file counting from 1)
2)     <-> POS
3,4,5) <-> VOCALIZED
       (with the separate boolean split information as hyphens on the
       VOCALIZED form)
6)     <-> GLOSS
7,8)   <-> OFFSETS
9)     <-> INPUT STRING
10)    <-> UNVOCALIZED
11)    <-> COMMENT

============================================================================
5. Additional Information
============================================================================

============================================================================
5a. data/xml/pos/FILE.xml not included
============================================================================

As in recent releases, we do not include the Annotation Graph .xml
file for the source token file.

============================================================================
5b. A note about multiple trees on one line
============================================================================

It is possible for one line in the .tree files to include more than
one complete tree.  The reason for this is that the annotators work on
one "Paragraph" (CHUNK) at a time - e.g., a tree with the root
"Paragraph" node, as can be seen by looking at the "treebanking"
feature in the xml files.  When the trees are generated, the
"Paragraph" node is dropped.  If the form of the annotation was
(Paragraph S1 S2), where S1 and S2 are both complete trees, then they
will appear on one line.

This is also true for the integrated format, except that in that
format "TOP" appears instead of "Paragraph."

============================================================================
5c. Non-ASCII punctuation characters
============================================================================

It's possible that the source text may contain non-ASCII punctuation
characters. This can be somewhat problematic for the Buckwalter
transliteration, since there is no mapping for these characters.


In these cases, we have indicated the Unicode value in the
IS_TRANS field in the POS file.  For example, in 
ar_4510_0107-0242.su.txt

 INPUT STRING:	÷
     IS_TRANS:	&#xf7;
        INDEX:	P86W2
      OFFSETS:	1-2
       TOKENS:	P86W2-P86W2
       STATUS:	has_sol=F	excluded=.	orig=ARZ	sama=.	calima_all=.	calima_pv=.
      COMMENT:	[]
        LEMMA:	None
   UNSPLITVOC:	(&#xf7;)
          POS:	NO_FUNC
          VOC:	?
        GLOSS:	nogloss