This release follows earlier treebank releases in that
we include information indicating the relationship with the
morphological analyzer. However, due to the nature of this corpus,
this relationship is now more complicated, containing references
to both the SAMA 3.1 Morphological Analyzer (LDC2010L01), for the
MSA tokens, and the CALIMA v0.5 Morphological Analyzer, for the
ARZ tokens.  This is discussed in more detail below.

Briefly, "source tokens" are the whitespace/punctuation-delimited
tokens (offset annotation) on the source text that receive a
morphological analysis through the SAMA analyzer.  The "tree tokens"
result from splitting up these source tokens into subsequences as
appropriate for the annotation of syntactic structure.

In this release, there are 400,448 source tokens and 508,548 tree tokens.

This terminology, along with information about the UNVOCALIZED and
INPUT STRING forms, is discussed more in this paper, included in this
release:

Consistent and Flexible Integration of Morphological Annotation in
the Arabic Treebank
Seth Kulick, Ann Bies, Mohamed Maamouri
LREC 2010
http://papers.ldc.upenn.edu/LREC2010/KulickBiesMaamouri-LREC2010.pdf

The annotation in this release was done simultaneously with 
development of the morphological analyzer. Therefore there are
some inevitable inconsistencies existing in the data between
the part-of-speech/vocalization/lemma solutions and morphological
analyzer solutions.  These will reconciled in a future release.  
This also affects the generation of the INPUT STRING forms for the
tree tokens, which will be upgraded in future releases.


==================================================================
Contents:
1. .su.xml files converted tdf files for annotation
2. Solution status field
3. Synchronization with CALIMA
4. File extensions and directory structure
5. Description of the "Integrated" format
6. Additional Information
7a. data/xml/pos/FILE.xml not included
7b. A note about multiple trees on one line
7c. Non-ASCII punctuation characters
7d. Characters not included in tokens
==================================================================

==================================================================
1. .su.xml files converted tdf files for annotation
==================================================================

The source files used for this corpus are the .su.xml files included
in the data/su_xml directory.  For purposes of POS annotation, these
were converted to the .tdf files included in the data/tdf directory.
The conversion used a simple script that searched for <su> tags and
captured two pieces of data: the "id" attribute and the text content
of the element. The "id" element was placed in the "file" column of
the TDF file and the text content was placed in the "transcript"
column. Other columns of the TDF were filled with predictable data.
Leading whitespace in the text from su_xml was stripped for inclusion
in the tdf file.

For convenience, we include a file token-mapping.txt that makes
explicit the linkage between the annotation files, the .tdf files,
and the .su.xml file.  There are 8 columns:

1) The location of the token, of the form filename:paragraph #:token#.
This corresponds to the pos/before file.
2) Token offsets, again as in the pos/before file.
3) The source text as included in the INPUT STRING field in the pos/before file.
4) The text taken directly from the .tdf file.
5) The id of this paragraph in the su_xml file.
6) The corresponding offsets of this token in the su_xml file.
7) The text taken directly from the su_xml file.
8) A potentially empty column, that will have the text "file/tdf
difference" if there is a difference between the text as in the INPUT
STRING field in the pos/before file and as in the tdf file.

In most cases the offsets are the same in columns 2 and 6, but may be
different if the leading whitespace was trimmed before being used in the
.tdf file.  There are a few cases in which column 8 does indicate a
file/tdf difference. These concern the characters Arabic question mark,
Arabic comma, and tatweel, which have traditionally been mapped to the
corresponding ascii characters in ATB releases.

==================================================================
2. Solution status field
==================================================================

Each file in pos/before has a set of entries for each source token.
One entry is the STATUS field. This consists of 6 fields of
information:

has_sol: T if the token has a solution, F if it doesn't (i.e., if the
POS tag is NO_FUNC).  If has_sol=F, then all the following fields are
set to . (period), except for orig, which remains as ARZ, although it
can be ignored in this case.

excluded: T if the token is such that it is not checked to have a
matching solution in the tables for either analyzer; F if it is not
checked.  If excluded=T, then all the following fields are set to
. (period), except for orig, which remains as ARZ, although it can be
ignored in this case.

orig: ARZ for Egyptian Arabic, or MSA for Modern Standard Arabic.

sama: T if the token's annotation exactly matches a solution in the
SAMA analyzer, and F otherwise.  Here "exactly matches" means that
there is a SAMA solution with matching POS, VOC, LEMMA, and
UNSPLIT_VOC (GLOSS is not checked).

calima_all: T if the token's annotation exactly matches a solution in
the CALIMA analyzer, and F otherwise.  Here "exactly matches" means
that there is a CALIMA solution with matching POS, VOC, LEMMA, and
UNSPLIT_VOC (GLOSS is not checked).

calima_pv: Three possible values.  If calima_all=T, then
calima_pv=.(period).  Otherwise, calima_pv=T if the token's annotation
matches a solution in the CALIMA analyzer for POS and VOC (i.e.,
LEMMA, UNSPLIT_VOC, and GLOSS are not checked), and calima_pv=F if
not.

The tokens overall are characterized as follows:
no solution                   5199 (has_sol=F)
excluded                     40294 (has_sol=T excluded=T)
MSA                           9682 (has_sol=T excluded=F orig=MSA)
ARZ                         345273 (has_sol=T excluded=F orig=ARZ)
total # tokens              349414

The excluded tokens are for POS tags such as PUNC, TYPO, NOUN_NUM and
ADJ_NUM if the latter two consist of all digits, etc.

The MSA tokens are categorized as follows:
sama=T         9164 (has_sol=T excluded=F orig=MSA sama=T)
sama=F          518 (has_sol=T excluded=F orig=MSA sama=F)
               ----
               9682 

The ARZ tokens are categorized as follows:
calima_all=T     294641 (has_sol=T excluded=F orig=ARZ calima_all=T)
calima_pv=T        4663 (has_sol=T excluded=F orig=ARZ calima_all=F calima_pv=T)
calima_pv=F       45969 (has_sol=T excluded=F orig=ARZ calima_all=F calima_pv=F)
                 ------
                 345273

Therefore 85.3% of the ARZ source tokens (294641/345273) are a complete 
match with CALIMA.

============================================================================
3. Synchronization with CALIMA
============================================================================

The treebank was developed together with the CALIMA analyzer. Therefore some 
annotations used for the treebank differed from later CALIMA representations
of the same solution. We have implemented a procedure to automatically correct
such cases, to maximize the synchronization of the treebank annotations and
CALIMA solutions.   For each source token in the treebank that is classified
as ARZ, and for which the solution was not a complete match with CALIMA
(calima_all=F), we did the following:

1. If the POS,VOC,LEMMA fields all match a solution in CALIMA for that word, 
and there is only one such matching solution, then use the UNSPLIT_VOC from that
solution. That is, it matched a solution in CALIMA except for the UNSPLIT_VOC,
and it was unambiguously determined by the POS,VOC,LEMMA what the UNSPLIT_VOC
should be to make it a complete match with CALIMA.

2. If the POS and VOC fields match a solution in CALIMA for that word, and there
is only one such matching solution, then use the LEMMA (and UNSPLIT_VOC if
different) from that solution.  

3. If the POS and LEMMA fields match a solution in CALIMA for that word, 
and there is only one such matching solution, then use the POS (and UNSPLIT_VOC
if different) from that solution.

4. If the POS field field matches a solution in CALIMA for that word, and there
is only one such matching solution, then use the POS,VOC, and LEMMA (and 
UNSPLIT_VOC if different) from that solution.

This resulted in modifications to 73,752 tokens, increasing the synchronization
from 64.0% to 85.3%.  We cannot guarantee that every such change was an
appropriate change, but on balance it seemed far desirable to increase the
synchronization, given the issues caused by the overlap of analyzer 
development and treebank annotation.

============================================================================
4. File extensions and directory structure
============================================================================

Each FILE in docs/file.ids has a corresponding file in the
following directories.

data/su_xml/FILE.su.xml  (utf-8)
    su.xml file (see Section 1)

data/tdf/FILE.tdf  (utf-8)
    tdf file (see Section 1)

data/pos/before/FILE.txt (utf-8)
   Information about the "source tokens" used for analysis
   with SAMA.  So this is a listing of each token before
   clitic-separation.
   Each token contains the following information:
-----------------------------------------------------------
INPUT STRING: (utf-8 characters from .tdf file)
    IS_TRANS: (Buckwalter transliteration of previous, used for input to
               CALIMA and SAMA)
       INDEX: (automatically assigned index, based on paragraph&word#)
     OFFSETS: (start-end - pair of integers - offset into
                            tdf file, corresponding to the INPUT STRING)
      TOKENS: (start-end - two indices indicating the tree tokens in the 
                            corresponding pos/after/FILE.txt file
                            that correspond to this source token)
      STATUS: (the status of this solution with respect to the analyzers,
               as discussed in Section 2)
     COMMENT: (a comment associated with this source token)
       LEMMA: (the lemma associated with this source token and solution in
              CALIMA or SAMA)
  UNSPLITVOC: (the vocalized form (not separated into segments) of the source
              token solution, from CALIMA or SAMA)
         POS: (POS for this source token)
         VOC: (vocalization for this source token)
       GLOSS: (gloss for this source token)
-----------------------------------------------------------
   The POS, VOC, and GLOSS fields are redundant with the respective values
   of the corresponding tree tokens. 

data/xml/treebank/FILE.xml
   The Annotation Graph .xml file for the tree tokens.

data/pos/after/FILE.txt
   Information about each tree token in the corresponding
   xml/treebank FILE.xml file.
   Each token contains the following information:
-----------------------------------------------------------
INPUT STRING: (utf-8 characters from .tdf file)
    IS_TRANS: (Buckwalter transliteration of previous)
     COMMENT: (annotator comment about word)
       INDEX: (automatically assigned index, based on paragraph&word#)
     OFFSETS: (start,end - pair of integers - Annotation Graph offset into 
                            tdf file, corresponding to the INPUT STRING)
 UNVOCALIZED: (the unvocalized form of the word)
   VOCALIZED: (the vocalized form of the word, taken from the solution)
         POS: (the pos tag, taken from the solution)
       GLOSS: (the gloss, taken from the solution)
-----------------------------------------------------------
   For more information about the derivation of the INPUT STRING and
   UNVOCALIZED fields for clitic separated tree tokens, see the paper
   "Consistent and Flexible Integration of Morphological Annotation in
   the Arabic Treebank", mentioned at the beginning of this readme.

data/penntree/without-vowel/FILE.tree
   Penn Treebanking style output, generated from the
   integrated file.  Each terminal is of the form
   (pos word), where pos and word correspond to the POS and UNVOCALIZED
   values for the corresponding token in pos/after/FILE.txt,
   respectively.

data/penntree/with-vowel/FILE.tree
   Penn Treebanking style output, generated from the
   integrated file. Each terminal is of the form
   (pos word), where pos and word correspond to the POS and VOCALIZED
   values for the corresponding token in pos/after/FILE.txt,
   respectively.

data/integrated/FILE.txt
   See section 4 for a description of the integrated format.

============================================================================
6. Description of the "Integrated" format
============================================================================

The goal of this format is to bring together in one place:
1) the information about the source tokens from the pos/before files,
including the explicit mapping between the source and tree tokens.
2) the information about the tree tokens from the pos/after files
3) the tree structure

The basic format of each file is:

FILEPREFIX:
file metadata (beginning with ;;)

CHUNK: filename:chunk#
chunk metadata (beginning with ;;)
#source tokens: #
#tree tokens: #
#trees: 1
listing of source tokens
listing of tree tokens

TREE: filename:chunk#:tree#:#_tokens_in_tree
tree with W# instead of the tree tokens

with the CHUNK,TREE sections repeated.

(The file and chunk metadata is taken directly from the Annotation Graph file
and can be ignored or used as the user likes.)

Each CHUNK corresponds to one "Paragraph" in the usual release
terminology.  (It is possible that in some versions of the treebank
more than one tree may be associated with a paragraph, which is why
there is a slot for "# trees" after the # of source tokens, and in the
TREE: line. In this release it is always 1 tree per chunk, however,
with TOP wrapped around it.)

Each source token row consists of the following 17 items, separated by the
character U+00B7:

1) s:# - the source token #
2) the source token text in utf8, corresponding to the source text.
3) the source token text, in Buckwalter transliteration
4) the starting offset
5) the ending offset
6) the start index of the corresponding tree token(s)
7) the end index of the corresponding tree token(s)
8) a field reserved for future use, which should be ignored
9) a field reserved for future use, which should be ignored
10) a field reserved for future use, which should be ignored
11) a field reserved for future use, which should be ignored
12) a field reserved for future use, which should be ignored
13) an annotator comment for this source token
14) the status of this token with respect to SAMA, as discussed in Section 1
above.  This is encoded as a 6 character string:
   character 1 is "has_sol"
   character 2 is "excluded"
   character 3 is "orig" (A for ARZ, M for MSA)
   character 4 is "sama"
   character 5 is "calima_all"
   character 6 is "calima_pv"   
15) The lemma for this source token
16) The unsplit vocalization for this source token
17) A status indicating whether this source token is mapped to corresponding
   tree tokens.  (All "OK")

These fields correspond to the information in the pos/before files as
follows:

1)   <-> INDEX
     (here counting from 0, in the pos/before file counting from 1)
2)   <-> INPUT STRING
3)   <-> IS_TRANS
4,5) <-> OFFSETS
6,7) <-> TOKENS
     (here counting from 0, in the pos/before file counting from 1)
14)  <-> STATUS
15)  <-> LEMMA
16)  <-> UNSPLITVOC

Each tree token row consists of the following 13 items:

1) t:# - the tree token #
2) POS tag
3) "f" or "t" - a boolean indicating whether this token was split
    from the previous tree token
4) "f" or "t" - a boolean indicating whether this token was split
    from the following tree token
5) vocalized form
6) gloss
7) offset start
8) offset end
9) text in utf8, corresponding to the source .tdf text
10) unvocalized form
11) comment

These fields correspond to the information in the pos/after files as
follows:

1)     <-> INDEX
       (here counting from 0, in the pos/after file counting from 1)
2)     <-> POS
3,4,5) <-> VOCALIZED
       (with the separate boolean split information as hyphens on the
       VOCALIZED form)
6)     <-> GLOSS
7,8)   <-> OFFSETS
9)     <-> INPUT STRING
10)    <-> UNVOCALIZED
11)    <-> COMMENT

============================================================================
7. Additional Information
============================================================================

============================================================================
7a. data/xml/pos/FILE.xml not included
============================================================================

As in recent releases, we do not include the Annotation Graph .xml
file for the source token file.

============================================================================
7b. A note about multiple trees on one line
============================================================================

It is possible for one line in the .tree files to include more than
one complete tree.  The reason for this is that the annotators work on
one "Paragraph" (CHUNK) at a time - e.g., a tree with the root
"Paragraph" node, as can be seen by looking at the "treebanking"
feature in the xml files.  When the trees are generated, the
"Paragraph" node is dropped.  If the form of the annotation was
(Paragraph S1 S2), where S1 and S2 are both complete trees, then they
will appear on one line.

This is also true for the integrated format, except that in that
format "TOP" appears instead of "Paragraph."

============================================================================
7c. Non-ASCII punctuation characters
============================================================================

It's possible that the source text may contain non-ASCII characters outside 
the range of the Buckwalter transliteration.

In these cases, we have indicated the Unicode value in the
IS_TRANS field in the POS file.  For example, in 

bolt-arz-DF-175-182185-10963606.arz.su.txt:

 INPUT STRING:	♦
     IS_TRANS:	&#x2666;
        INDEX:	P104W5
      OFFSETS:	22-23
       TOKENS:	P104W6-P104W6
       STATUS:	has_sol=T	excluded=T	orig=ARZ	sama=.	calima_all=.	calima_pv=.
      COMMENT:	[]
        LEMMA:	[DEFAULT]
   UNSPLITVOC:	(&#x2666;)
          POS:	PUNC
          VOC:	?
        GLOSS:	?

============================================================================
7d. Characters not included in tokens
============================================================================
There are 3102 cases of sequences of characters in the source text
that have not been included as tokens. 

The full listing of the 3102 cases is in the file not-included.txt.