This release continues to include the improvements in the organization
of the data and certain aspects of the annotation that have been
present in part or whole since ATB5.  For details on these changes,
particularly with regard to the INPUT STRING and UNVOCALIZED fields,
and the STATUS field indicating the relationship with the SAMA
3.1 Morphological Analyzer (LDC2010L01), please see the following paper,
also included in this release for your reading convenience:

Consistent and Flexible Integration of Morphological Annotation in 
the Arabic Treebank 
Seth Kulick, Ann Bies, Mohamed Maamouri
LREC 2010

The documentation below refers to "source tokens" and "tree tokens",
as explained in detail in this paper. Briefly, "source tokens" are the
whitespace/punctuation-delimited tokens (offset annotation) 
on the source text that receive a morphological analysis through the
SAMA analyzer.  The "tree tokens" result from splitting up these
source tokens into subsequences as appropriate for the annotation of
syntactic structure.


Contents:
1. STATUS field totals
2. File Extensions and Directory Structure
3. Additional Information
3a. data/xml/pos/FILE.xml not included
3b. Description of the "Integrated" format
3c. A note about multiple trees on one line
3d. Metadata characters left out

=======================================================
1. Integration with SAMA
=======================================================

A significant change from the previous release of this data is that
information is now included making explicit the relation between each
source token and the SAMA 3.1 Morphological analyzer, as detailed in
the paper referenced above. We include here the totals for the
treebank-SAMA consistency for this release.

In this release, there are 432,976 source tokens, resulting in
517,080 tree tokens (after clitic splitting).  These 432,976
source tokens have the following categorizations:

STATUS 1: 415924  Included in SAMA
STATUS 2:    735  Limited Solution
STATUS 3:   3474  Pending SAMA Solution
STATUS 4:  12843  Excluded from check with SAMA
================
          432976

Note: We have refined the definition of status 4 since the 
paper, so that all tokens with PARTIAL, DIALECT, TRANSERR, and FOREIGN
are status 4.

============================================================================
2. File Extensions and Directory Structure
============================================================================

Each FILE in docs/file.ids has a corresponding file in the 
following directories.   

data/sgm/FILE.sgm  (utf-8)
    Source files.

data/pos/before/FILE.txt (utf-8)
   Information about the tokens used for analysis with SAMA
   (the "source tokens").
   This is a listing of each token before clitic-separation.
   Each token contains the following information:
-----------------------------------------------------------
INPUT STRING: (utf-8 characters from .sgm file)
    IS_TRANS: (Buckwalter transliteration of previous, used for input to SAMA)
       INDEX: (automatically assigned index, based on paragraph&word#)
     OFFSETS: (start,end - pair of integers - Annotation Graph offset into 
                            sgm file, corresponding to the INPUT STRING)
      TOKENS: (start-end - two indices indicating the tree tokens in the 
                            corresponding pos/after/FILE.txt file
                            that correspond to this source token)
      STATUS: (the status of this solution, with respect to SAMA)
       LEMMA: (the lemma associated with this source token and solution in 
              SAMA)
  UNSPLITVOC: (the vocalized form (not separated into segments) of the source
              token solution, from SAMA)
         POS: (pos for this source token)
         VOC: (vocalization for this source token)
       GLOSS: (gloss for this source token)
-----------------------------------------------------------
   The POS, VOC, and GLOSS fields are redundant with the respective values of
   the corresponding tree tokens.

data/xml/treebank/FILE.xml
   This Annotation Graph file consists of the "tree token" annotations.

data/pos/after/FILE.txt
   Information about each tree token in the corresponding
   xml/treebank FILE.xml file. 
   Each token contains the following information:
-----------------------------------------------------------
INPUT STRING: (utf-8 characters from .sgm file)
    IS_TRANS: (Buckwalter transliteration of previous)
     COMMENT: (annotator comment about word)
       INDEX: (automatically assigned index, based on paragraph&word#)
     OFFSETS: (start,end - pair of integers - Annotation Graph offset into 
                            sgm file, corresponding to the INPUT STRING)
 UNVOCALIZED: (the unvocalized form of the word)
   VOCALIZED: (the vocalized form of the word, taken from the solution)
         POS: (the pos tag, taken from the solution)
       GLOSS: (the gloss, taken from the solution)
-----------------------------------------------------------
   For more information about the derivation of the INPUT STRING and
   UNVOCALIZED fields for clitic separated tree tokens, see the paper
   "Consistent and Flexible Integration of Morphological Annotation in
   the Arabic Treebank," mentioned at the beginning of this file.

data/penntree/without-vowel/FILE.tree
   Penn Treebanking style output, generated from the
   xml/treebank .xml file.  Each terminal is of the form 
   (pos word), where pos and word correspond to the POS and UNVOCALIZED
   values for the corresponding token in pos/after/FILE.txt,
   respectively.

data/penntree/with-vowel/FILE.tree
   Penn Treebanking style output, generated from the
   xml/treebank .xml file. Each terminal is of the form 
   (pos word), where pos and word correspond to the POS and VOCALIZED
   values for the corresponding token in pos/after-treebank/FILE.txt,
   respectively.
   
data/integrated/FILE.txt
   See section 3b for a description of the integrated format.

(Note: The file formats in prior releases were somewhat different. 
Due to the nature of the ongoing improvements, the notes here refer only
to this release.  For similar information for previous releases, please see
the corresponding documentation in those releases.)

============================================================================
3. Additional Information
============================================================================

============================================================================
3a. data/xml/pos/FILE.xml not included
============================================================================

The data/xml/pos/FILE.xml file, if included, would contain the alternatives 
from the morphological analyzer at the time the analysis was originally done,
which are sometimes then modified later in the annotation process.  As a
result, the POS information in the pos-level .xml files is not necessarily the
same as in the treebank-level .xml files.  To avoid confusion we therefore do
not release the pos-level .xml files. Instead, the data/pos/before
.txt files and the integrated files contain the information regarding the 
source token and now also with this release further information on the token's
analysis in the treebank and its relation to SAMA.


============================================================================
3b. Description of the "Integrated" format
============================================================================

The goal of this format is to bring together in one place:
1) the information about the source tokens from the pos/before files,
including the explicit mapping between the source and tree tokens
2) the information about the tree tokens from the pos/after files
3) the tree structure

The basic format of each file is:

FILEPREFIX:
file metadata (beginning with ;;)

CHUNK: filename:chunk#
chunk metadata (beginning with ;;)
#source tokens: #
#tree tokens: #
#trees: 1
listing of source tokens
listing of tree tokens

TREE: filename:chunk#:tree#:#_tokens_in_tree
tree with W# instead of the tree tokens

with the CHUNK,TREE sections repeated.

(The file and chunk metadata is taken directly from the Annotation Graph file
and can be ignored or used as the user likes.) 

Each CHUNK corresponds to one "Paragraph" in the usual release terminology.
(It is possible that in some versions of the treebank more than one tree may
be associated with a paragraph, which is why there is a slot for
"# trees" after the # of source tokens, and in the TREE: line. In this release
it is always 1 tree per chunk, however, with TOP wrapped around it.)

Each source token row consists of the following 7 items, separated by the 
character U+00B7:

1) s:# - the source token #
2) the source token text in utf8, corresponding to the source text
3) the source token text, in Buckwalter transliteration 
4) the starting offset
5) the ending offset
6) the start index of the corresponding tree token(s)
7) the end index of the corresponding tree token(s)
8) the status of this token with respect to SAMA, as discussed in Section 1
above
9) the lemma for this source token
10) the unsplit vocalization for this source token
11) a status indicating whether this source token is mapped to corresponding
   tree tokens (All "OK")

These fields correspond to the information in the pos/before files
as follows:

1)   <-> INDEX 
     (here counting from 0, in the pos/before-treebank file counting from 1)
2)   <-> INPUT STRING
3)   <-> IS_TRANS
4,5) <-> OFFSETS
6,7) <-> TOKENS  
     (here counting from 0, in the pos/before-treebank file counting from 1)
8)   <-> STATUS
9)   <-> LEMMA
10)  <-> UNSPLITVOC

Each tree token row consists of the following 11 items:

1) t:# - the tree token #
2) POS tag
3) "f" or "t" - a boolean indicating whether this token was split
    from the previous tree token
4) "f" or "t" - a boolean indicating whether this token was split 
    from the following tree token
5) vocalized form
6) gloss
7) offset start
8) offset end
9) text in utf8, corresponding to the source .sgm text
10) unvocalized form
11) comment

These fields correspond to the information in the pos/after files
as follows:

1)     <-> INDEX
       (here counting from 0, in the pos/before-treebank file counting from 1)
2)     <-> POS
3,4,5) <-> VOCALIZED
       (with the separate boolean split information as hyphens on the
       VOCALIZED form)
6)     <-> GLOSS
7,8)   <-> OFFSETS
9)     <-> INPUT STRING
10)    <-> UNVOCALIZED
11)    <-> COMMENT


For example if a file has a chunk with:
#source tokens:32
#tree tokens:35
#trees:1
s:0  ·واوضح·wAwDH·1·7·0·1·1·[>awoDaH_1]·(wa>awoDaHa)·OK
[...]
t:0  ·CONJ·f·t·wa·and·1·2·و·w·[Separated]
t:1  ·PV+PVSUFF_SUBJ:3MS·t·f·>awoDaH+a·clarify/explain/indicate + he/it [verb]·2·7·اوضح·>wDH·[]
[...]


TREE:20000715_AFP_ARB.0009:2:1:35
(TOP (S W0 (VP W1 (NP-SBJ W2) (SBAR W3 (S (NP-TPC-2 (NP W4) (SBAR (WHNP-1 W5) (S (VP W6 (NP-SBJ-1 (-NONE- *T*)) (PP-DIR W7 (NP W8)) (NP-TMP (NP W9 (NP W10)) (NP W11)) (S-ADV (VP W12 (NP-SBJ (-NONE- *)) (PP-DIR W13 (NP W14)))))))) (VP (PRT W15) W16 (NP-SBJ-2 (-NONE- *T*)) (NP-ADV W17 (NP (NP (NP W18 (NP W19)) (ADJP W20)) (NP W21 W22 W23 W24))) (PP W25 (SBAR W26 (S (VP W27 (NP-OBJ W28) (NP-SBJ W29 W30 W31) (NP-TMP W32 (NP W33)))))))))) W34))

This indicates that:
1) The source token#0 maps to tree tokens #s 0 and 1.
The source token text is wAwDH, and the two corresponding tokens 
are wa/CONJ and >awoDAH+a/PV+PVSUFF_SUBJ:3MS. They are represented 
in the VOCALIZED field in the pos/after file with
wa- and ->awoDaH+a, whereas here the hyphen is indicated by the
f/t and t/f values in t:0 and t:1. (Of course this information is also
redundant with the mapping from the source token.)
2) The source token offset <1,7> for the source token wAwDH
has been partitioned into <1,2> for wa and <2,7> for >awoDaH+a.
3) The leaves W0 and W1 in the tree correspond to the tree tokens 
wa- and ->awoDaH+a.

============================================================================
3c. A note about multiple trees on one line
============================================================================

It is possible for one line in the .tree files to include more than
one complete tree.  The reason for this is that the annotators work on
one "Paragraph" (CHUNK) at a time - e.g., a tree with the root "Paragraph"
node, as can be seen by looking at the "treebanking" feature in the
xml files.  When the trees are generated, the "Paragraph" node is
dropped.  If the form of the annotation was (Paragraph S1 S2), where
S1 and S2 are both complete trees, then they will appear on one line.

This is also true for the integrated format, except that in that
format "TOP" appears instead of "Paragraph."  

============================================================================
3d. Metadata characters left out
e============================================================================

There are 8028 cases of sequences of characters in the source text
that have not been included as tokens. These are generally metadata
included in the source files.  for example, in

ABUDHABI_ABUDHNEWS_ARB_20070111_115801.qrtr:76, paragraph #76, 
the text <non-MSA> at (53,64) is not included in the annotated tokens.