This release of this segment of the Arabic Treebank contains several
improvements in the organization of the data and certain aspects of the 
annotation since the previous release of this segment.
These changes are primarily:

1. Improvements have been made to the creation of the INPUT STRING tokens

2. Improvements have been made to the creation of the UNVOCALIZED tokens

3. An "integrated" file format has been included, bringing together in one 
place all of the information formerly spread out among different file formats 
and directories.  This includes the tree structure information, the different
forms of the tree tokens, and the relation between the tree tokens and source
tokens. 

4. A "Solution status" field is now included with each source token, making
explicit the relation between each source token and the SAMA 3.1 Morphological
Analyzer (LDC 2009E73).

5. The relation between a source token and its corresponding tree tokens is now
made explicit, rather than only implicitly through offset information.

Sections 1 and 2 describe these new changes, and the integrated format
is described in detail in Section 4b.

==================================================================
Contents:
1. Improvements to INPUT STRING and UNVOCALIZED tokens
1a. Background for changes
1b. Improvements to problematic INPUT STRING tokens
1c. Improvements to problematic UNVOCALIZED tokens
2. Additional information for morphological annotations
2a. "Solution status" field is now included
2b. Explicit mapping between source and tree tokens, and
other changes in "before" pos file.
3. File Extensions and Directory Structure
4. Additional Information
4a. data/xml/pos/FILE.xml not included
4b. Description of the "Integrated" format
4c. A note about multiple trees on one line

=======================================================
1. Improvements to INPUT STRING and UNVOCALIZED tokens
=======================================================

The improvements in this section were first initiated for 
ATB5, an additional segment of the Arabic Treebank that is 
being prepared for a public release in 2011.  
The specific file and token references in this section refer
to data in ATB5, since this description of the improvements
was originally written to refer to that data.  However, the changes
described here for that data also apply to the current release.  
Note that the work on ATB5, and hence this improved creation
of the INPUT STRING and UNVOCALIZED values, was done between
the previous release of ATB3 and this current release.

-----------------------------------------------------------------
1a. Background for changes.
-----------------------------------------------------------------

There are two main parts to the treebank word-level tokenization.

1. The source text is broken up into roughly whitespace delimited tokens, 
henceforth called the "source tokens."  These are the tokens that are run 
through the SAMA morphological analyzer, resulting in a vocalized form, and 
information on these tokens has traditionally been included (and still is)
in the /data/pos/before-treebank directory.

2. These source tokens are split apart if appropriate during annotation 
(preposition prefixes, direct object suffixes, etc.). These tokens will 
henceforth be referred to as the "tree tokens," since these are the tokens
actually used for treebanking.  These tokens are traditionally included (and 
still are) in various formats in the data/xml/treebank, 
data/pos/after-treebank, and data/penntree/(with,without)-vowel directories.

For all of the source tokens that receive solutions from SAMA, the treebank 
annotation takes place on the *vocalized* tree tokens, since those are the 
output of SAMA, sometimes split into separate tokens. The solution from SAMA 
is a sequence of segments, each including vocalization/POS/gloss information, 
and these segments are partitioned into one or more tree tokens that together
correspond to the original source token.

For example:
-------------------------------------------------------------
One source token:
unvocalized - original text
        e.g., yktbh
vocalized - solution from SAMA
        e.g., [ya+kotub+u+hu, IV3MS+IV+IVSUFF_MOOD:I+IVSUFF_DO:3MS]
               he/it + write + [ind.] + it/him

Two corresponding tree token(s):
vocalized - vocalized source token potentially split
        e.g., [ya+kotub+u,IV3MS+IV+IVSUFF_MOOD:I] and [hu,IVSUFF_DO:3MS]
-------------------------------------------------------------

Note that there is no other level of annotation of the tree token involved
in this process -- the annotated tree tokens are the vocalized tokens.  
Therefore, any type of unvocalized tree token that is released is derived from
from this annotation in some way.  (The situation is different for the 
relatively infrequent tokens with solution status 2, 
as described in Section 2a.) 
In principle,  it could be left to users to experiment with the 
relation between the source token (what is actually present in the source file)
and the vocalized tree tokens (the end result of the annotation).

However, in all previous releases of the Arabic Treebank corpora, two
other forms of the tree tokens were released as well:

1. One, which we will call here here the INPUT STRING, was an attempt to split
up the source token into substrings such that each substring corresponded to 
one of the vocalized tree tokens.  For example, in the above example, the two
tree tokens might have the INPUT STRINGs "yktb" and "h".

2. Also, an UNVOCALIZED form was included, which was a sort of a hybrid in 
earlier releases.  For source tokens that were not split, the 
UNVOCALIZED form was identical to the INPUT STRING.  For source tokens that 
were split, each UNVOCALIZED form was set to be simply the VOCALIZED form with
diacritics removed.  This hybrid nature of the UNVOCALIZED form is discussed 
more in this paper: (and also in section 1c below)

Mohamed Maamouri, Seth Kulick, Ann Bies 
Diacritic Annotation in the Arabic Treebank and Its Impact on Parser
Evaluation; LREC 2008, Marrakech, Morocco, May 28-30, 2008
http://papers.ldc.upenn.edu/LREC2008/Diacritic_Annotation_ATB.pdf

While these derivative forms were supplied primarily for convenience, not as 
part of the annotation, we have endeavored in this release to fix all problems 
associated with the creation of these two forms. 

-----------------------------------------------------------------
1b. Improvements to problematic INPUT STRING tokens
-----------------------------------------------------------------

The algorithm used, prior to ATB5, to create the INPUT STRING tokens for the
tree  tokens sometimes created incorrect INPUT STRING tokens.  (We use the
phrase "INPUT STRING token" to mean the INPUT STRING value associated with
some tree token; see Section 3 for definitions of all the values associated
with each token.)

For example, the source token "y>hlhA" might be given a solution
resulting in the two vocalized tree tokens "yu+>ah~il+u" and "hA". 
Using the old algorithm, the INPUT STRING tokens would have been
"y>hlh" and "A", clearly incorrect. With the new algorithm,
they are instead "y>hl" and "hA".

Another example: the source token "EmA" might be given a solution
resulting in the two vocalized tree tokens "Em" and "A".
Using the old algorithm, the INPUT STRING tokens would have been
"Em" and "A".  Instead they are now "E" and "mA".

The algorithm used since ATB5, and for this release, corrects such cases.
However, there is no general solution to the problem of using 
the source token and vocalized tree tokens in order to split up the source 
token accordingly.  The specific solution essentially requires accounting 
for all of the various sorts of normalization that might occur in SAMA as 
part of producing the vocalized tree tokens for each future corpus.  We plan
for future releases to continue utilizing the present improved creation of the
INPUT STRING tree tokens, as is done in this release.  However, this is not 
part of the annotation process itself, as explained above, and it is possible 
that future releases either will not include extensive checking on the
creation of these INPUT STRING tree tokens, or will leave out completely
such tokens.

(There are some remaining cases that are somewhat trickier to
categorize. For example, the source token INPUT STRING 
"mnA" has the solution "min+nA", separated into two different 
vocalized tree tokens, with the corresponding UNVOCALIZED tokens
"mn" and "nA" (see following section). However, there is only one
"n" to distribute among the two tokens for the tree token INPUT
STRING.  We have chosen to partition this as "m"+"nA", to keep the
the INPUT STRING representation of the suffix consistent.)

-----------------------------------------------------------------
1c. Improvements to problematic UNVOCALIZED tokens
-----------------------------------------------------------------

As noted above in Section 1a, UNVOCALIZED tokens had an odd sort of hybrid 
definition. This led to inconsistencies in the treebank.

While the vocalized tree tokens have a clear definition as part of the 
annotation process, and the INPUT STRING tree tokens also have a reasonably 
clear meaning (even if nontrivial to obtain), this is not true of the 
UNVOCALIZED tokens.  In this release we have simplified the definition to make
the UNVOCALIZED tree tokens be the VOCALIZED tree tokens with diacritics 
stripped out (i.e., treating all tokens in the same way as split tokens were 
treated in earlier releases of this segment.)

We illustrate this change with two examples showing what the UNVOCALIZED forms 
would have been without the current corrections, and how the current 
definition resolves previous inconsistencies.

--------------------------------------------
EXAMPLE 1:
--------------------------------------------
1) ALJZ_NEWS15_ARB_20060111_085801, P51
   source token=Ant$Arh yields two tree tokens:

   tree token P51W14
      VOCALIZED: {inoti$Ar+u-
       IS_TRANS: Ant$Ar
    UNVOCALIZED: {nt$Ar

   tree token P51W15
      VOCALIZED: -hu
       IS_TRANS: h
    UNVOCALIZED: h

   Since the source token was split, the UNVOCALIZED string for P51W14
   was set to the VOCALIZED token with diacritics removed under the
   old algorithm.

2) ALJZ_NEWS15_ARB_20050104_090001 P62
   source token=Ant$Ar yields one tree token:

   tree token P62W9
      VOCALIZED: {inoti$Ar+u
       IS_TRANS: Ant$Ar
    UNVOCALIZED: Ant$Ar

   Since the source token is not split, the UNVOCALIZED string for P62W9
   was set to IS_TRANS, under the old algorithm.

Therefore, using the old algorithm the two tokens appeared with the
same input string (Ant$Ar) and the same vocalized token ({inoti$Ar+u),
but different unvocalized tokens ({nt$Ar and Ant$Ar).

Using the new algorithm in this release with the current fix, the
UNVOCALIZED string for both is {nt$Ar.

--------------------------------------------
EXAMPLE 2:
--------------------------------------------
1) ALHURRA_NEWS13_ARB_20050412_130100, P141
   source token =  AlAqtSAd yields one tree token:

   tree token P141W21
      VOCALIZED: Al+{iqotiSAd+i
       IS_TRANS: AlAqtSAd
    UNVOCALIZED: AlAqtSAd

   Since the source token is unsplit, the UNVOCALIZED string for
   P141W21 was set to the IS_TRANS string, using the old algorithm.

2) ALHURRA_NEWS13_ARB_20051124_130100, P225
   source token =  bAlAqtSAd yields two tree tokens

   tree token P225W1
      VOCALIZED: bi-
       IS_TRANS: b
    UNVOCALIZED: b

   tree token P225W2
      VOCALIZED: -Al+{iqotiSAd+i
       IS_TRANS: AlAqtSAd
    UNVOCALIZED: Al{qtSAd

   Since the source token is split, the UNVOCALIZED string for P225W2
   was set to the VOCALIZED token with diacritics removed, under the
   old algorithm.

3) ALHURRA_NEWS13_ARB_20051124_130100, P8
   source token =  Al<qtSAd yields one tree token

   tree token P8W33
      VOCALIZED: Al+{iqotiSAd+i
       IS_TRANS: Al<qtSAd
    UNVOCALIZED: Al<qtSAd

   Since the source token is unsplit, the UNVOCALIZED string for P8W33
   was set to the IS_TRANS string, under the old algorithm.

Therefore, under the old algorithm, the vocalized token Al+{iqotiSAd+i
appeared with two different INPUT STRING tokens (AlAqtSAd and Al<qtSAd)
and three different UNVOCALIZED forms (AlAqtSAd,Al{qtSAd, and
Al<qtSAd).

Using the new algorithm in this release with the current fix, all three cases 
have the UNVOCALIZED form Al{qtSAd.  The INPUT STRING is a faithful recording 
of what is in the original source text, and so continues to appear as both 
AlAqtSAd and Al<qtSAd, since it appears both ways in the text.

--------------------------------------------
Even with the current corrections, it is sometimes the case that the UNVOCALIZED 
and INPUT STRING version of a tree token are different, as just noted for the 
previous example.  This was also true of the "EmA" example in Section 1b, in which 
the source token EmA corresponds to the two vocalized tree tokens Ean and mA
The first token has the INPUT STRING token "E" and the UNVOCALIZED token "En".
The former arises from the split of the source token text, while
the latter arises from the removal of diacritics from the vocalized token.
--------------------------------------------

It is also sometimes the case that the INPUT STRING is empty while the
UNVOCALIZED form is not. For example, in
ALHURRA_NEWS13_ARB_20051123_130100.qrtr the source token <ly
corresponds to the two tree tokens P69W27 and P68W28, with vocalized
forms <ilay and ya, respectively.  The UNVOCALIZED forms for the two
tokens are <ly and y.  The INPUT STRING for the first token is <ly,
while for the second token it is empty.  This is because there is only
one "y" in the source token to partition amongst the two tokens.


=======================================================
2. Additional information for morphological annotations
=======================================================

=======================================================
2a. "Solution status" field is now included
=======================================================

A significant change from the previous release of this data is that
information  is now included making explicit the relation between each source
token and the SAMA 3.1 Morphological analyzer.  

Each token in the "before" pos file, which contains the information for the 
source tokens, includes a line for "STATUS", which has one of the values 1-4.
These values have the following meanings:

1. The source token and associated solution exactly match one of the possible 
solutions for this source token in SAMA 3.1.  That is, the
(POS,VOC,GLOSS,LEMMA,UNSPLITVOC) fields in the pos/before file for a given 
INPUT STRING (source token) exactly match one of the solutions in SAMA for 
that INPUT STRING.

For example, in file ANN200201115.0001.txt:

 INPUT STRING: جندياً
     IS_TRANS: jndyAF
        INDEX: P1W2
      OFFSETS: 4-11
       TOKENS: P1W2-P1W2
       STATUS: 1
        LEMMA: [junodiy~_1]
   UNSPLITVOC: (junodiy~AF)
          POS: NOUN+CASE_INDEF_ACC
          VOC: junodiy~+AF
        GLOSS: soldier + [acc.indef.]

indicates that the given solution exactly matches one of the SAMA solutions 
for the input word jndyAF

2. The source token and associated solution are not a possible SAMA solution. 
It has instead been entered as a "manual" solution, which is a solution of
a very limited form, in which the "vocalization" is just the input string,
perhaps partitioned into separate segments.

For example, in file ANN20020215.0091.txt:
 INPUT STRING: سنترال
     IS_TRANS: sntrAl
        INDEX: P4W42
      OFFSETS: 229-236
       TOKENS: P4W49-P4W49
       STATUS: 2
        LEMMA: [TBupdate]
   UNSPLITVOC: None
          POS: FOREIGN
          VOC: sntrAl
        GLOSS: nogloss

This word is status 2 because the given solution 
is not in SAMA (and so cannot be status 1), 
and furthermore is status 2 because 
the VOC value is the same as the IS_TRANS

3. The source token and associated solution are not included in SAMA, 
but there has been some vocalization given as well. These solutions
should be considered as "pending" SAMA solutions, awaiting further arbitration
to  be included in SAMA proper.

For example, in ANN20020115.0003.txt:

 INPUT STRING: بانه
     IS_TRANS: bAnh
        INDEX: P6W15
      OFFSETS: 68-73
       TOKENS: P6W18-P6W20
       STATUS: 3
        LEMMA: [bi>an~a_1]
   UNSPLITVOC: (bi>an~ahu)
          POS: PREP+SUB_CONJ+PRON_3MS
          VOC: bi+>an~a+hu
        GLOSS: by/with+that+it/he

This word is status 3 because the given solution 
is not in SAMA (and so cannot be status 1), 
and furthermore is status 3 because 
the VOC value is not the same as the IS_TRANS

4. The source token is a case of punctuation or a foreign 
word that is not included in the check for consistency with SAMA.  

For example, in ANN20020115.0001.txt:
 INPUT STRING: 650
     IS_TRANS: 650
        INDEX: P1W1
      OFFSETS: 0-4
       TOKENS: P1W1-P1W1
       STATUS: 4
        LEMMA: [DEFAULT]
   UNSPLITVOC: (650)
          POS: NOUN_NUM
          VOC: 650
        GLOSS: nogloss


This status field is also included now as field 8 of the source token
in the integrated format. (See Section 4b.)

In this release, there are 339710 source tokens, categorized with the
following statuses:

STATUS 1: 287282
STATUS 2:    949
STATUS 3:   4323
STATUS 4:  47156
         =======
          339710

In current annotation and future releases of this segment, the
intent is that STATUS 2 will be reserved for those words
that are Arabic but are not expected to have a solution in 
SAMA (DIALECT, TYPO, FOREIGN, etc.), while STATUS 3 will be
reserved for those words that are Arabic and would ideally have
a solution in SAMA (such as the bAnh example above).  STATUS 4
will continue to be used for source tokens that are non-Arabic
and so "outside" of SAMA.

Please see the file errata.txt for more discussion of the tokens
that have status 3.

=======================================================
2b. Explicit mapping between source and tree tokens, and
other changes in "before" pos file.
=======================================================

The "before" pos file contains not only the STATUS information as 
described in 2a above, but also the full information for 
this field as it exists in a SAMA solution, and an explicit mapping to the 
tree tokens.

The new field TOKENS: indicates the mapping between the source token and
the corresponding tree tokens, which may be a 1-many relationship.  For
example, the token at index P2W11 in the "before" pos file for
ANN20020115.0001.txt is:

 INPUT STRING: سيشاركون
     IS_TRANS: sy$Arkwn
        INDEX: P2W11
      OFFSETS: 69-78
       TOKENS: P2W12-P2W13
       STATUS: 1
        LEMMA: [$Arak_1]
   UNSPLITVOC: (sayu$Arikuwna)
          POS: FUT_PART+IV3MP+IV+IVSUFF_SUBJ:MP_MOOD:I
          VOC: sa+yu+$Arik+uwna
        GLOSS: will+they (people) + participate with/share with + [masc.pl.]

which indicates that the two tree tokens, P2W12 and P2W13, arise
from this source token, as shown in the "after" pos file:

 INPUT STRING: س
     IS_TRANS: s
      COMMENT: []
        INDEX: P2W12
      OFFSETS: 69,70
  UNVOCALIZED: s
    VOCALIZED: sa-
          POS: FUT_PART
        GLOSS: will

 INPUT STRING: يشاركون
     IS_TRANS: y$Arkwn
      COMMENT: []
        INDEX: P2W13
      OFFSETS: 70,78
  UNVOCALIZED: y$Arkwn
    VOCALIZED: -yu+$Arik+uwna
          POS: IV3MP+IV+IVSUFF_SUBJ:MP_MOOD:I
        GLOSS: they (people) + participate with/share with + [masc.pl.]


The POS, VOC, and GLOSS fields in the "before" pos file are simply a
concatenation of their respective values in the corresponding tree tokens 
(with + as a separator and hyphens removed).  The LEMMA is now included 
in the "before" pos file, rather than the "after". That is because
the lemma is associated with a SAMA solution, and therefore associated with a 
source token. Earlier releases took the unnecessary step of assigning the 
lemma to one particular tree token, sometimes in an arbitrary way, with the 
other tokens for that lemma assigned the dummy lemma "[clitics]".  The 
UNSPLITVOC is the SAMA vocalization for the source token as a single word,
which can in some cases be distinct from the VOC formed from the vocalizations
of the separate morphemes.


============================================================================
3. File Extensions and Directory Structure
============================================================================

Each FILE in docs/file.ids has a corresponding file in the 
following directories.   

data/tdf/FILE.tdf  (utf-8)
    Source files.

data/pos/before-treebank/FILE.txt (utf-8)
   Information about the tokens used for analysis with SAMA
   (the "source tokens," in the terminology used in Section 1).
   So this is a listing of each token before clitic-separation.
   Each token contains the following information:
-----------------------------------------------------------
INPUT STRING: (utf-8 characters from .tdf file)
    IS_TRANS: (Buckwalter transliteration of previous, used for input to SAMA.)
       INDEX: (automatically assigned index, based on paragraph&word#)
     OFFSETS: (start,end - pair of integers - Annotation Graph offset into 
                            tdf file, corresponding to the INPUT STRING)
      TOKENS: (start-end - two indices indicating the tree tokens in the 
                            corresponding pos/after-treebank/FILE.txt file
                            that correspond to this source token)
      STATUS: (the status of this solution, with respect to SAMA.)
       LEMMA: (the lemma associated with this source token and solution in SAMA.)
  UNSPLITVOC: (the vocalized form (not separated into segments) of the source
              token solution, from SAMA)
         POS: (pos for this source token)
         VOC: (vocalization for this source token)
       GLOSS: (gloss for this source token)
-----------------------------------------------------------
   The POS, VOC, and GLOSS fields are redundant with the respective values of the
   corresponding tree tokens.  See Section 2  above for more detail on these
   fields, along with STATUS, LEMMA, and UNSPLITVOC.   

data/xml/treebank/FILE.xml
   As discussed in Section 1a, this consists of the result of splitting
   the tokens used for POS Annotation for the purposes of treebank
   annotation, and then modified during Treebank Annotation with
   tree information and further POS changes.  These are referred to 
   as "tree tokens" in Section 1a.

data/pos/after-treebank/FILE.txt
   Information about each tree token in the corresponding
   xml/treebank FILE.xml file. 
   Each token contains the following information:
-----------------------------------------------------------
INPUT STRING: (utf-8 characters from .tdf file)
    IS_TRANS: (Buckwalter transliteration of previous)
     COMMENT: (annotator comment about word)
       INDEX: (automatically assigned index, based on paragraph&word#)
     OFFSETS: (start,end - pair of integers - Annotation Graph offset into 
                            tdf file, corresponding to the INPUT STRING)
 UNVOCALIZED: (the unvocalized form of the word)
   VOCALIZED: (the vocalized form of the word, taken from the solution)
         POS: (the pos tag, taken from the solution)
       GLOSS: (the gloss, taken from the solution)
-----------------------------------------------------------
   See Sections 1b and 1c above for information about the derivation
   of INPUT STRING tokens and UNVOCALIZED tokens for clitic separated
   tree tokens.

data/penntree/without-vowel/FILE.tree
   Penn Treebanking style output, generated from the
   xml/after-treebank .xml file.  Each terminal is of the form 
   (pos word), where pos and word correspond to the POS and UNVOCALIZED
   values for the corresponding token in pos/after-treebank/FILE.txt,
   respectively.

data/penntree/with-vowel/FILE.tree
   Penn Treebanking style output, generated from the
   xml/after-treebank .xml file. Each terminal is of the form 
   (pos word), where pos and word correspond to the POS and VOCALIZED
   values for the corresponding token in pos/after-treebank/FILE.txt,
   respectively.
   
data/integrated/FILE.txt
   See section 4b for a description of the integrated format.

(Note: The file formats in prior releases were somewhat different. 
Due to the nature of the ongoing improvements, the notes here refer only
to this release.  For similar information for previous releases, please see
the corresponding documentation in those releases.)

============================================================================
4. Additional Information
============================================================================

============================================================================
4a. data/xml/pos/FILE.xml not included
============================================================================

The data/xml/pos/FILE.xml file, if included, would contain the alternatives 
from the morphological analyzer at the time the analysis was originally done,
which are sometimes then modified later in the annotation process.  As a
result, the POS information in the pos-level .xml files is not necessarily the
same as in the treebank-level .xml files.  To avoid confusion we therefore do
not release the pos-level .xml files. Instead, the data/pos/before-treebank
.txt files and the integrated files contain the information regarding the 
source token and now also with this release further information on the token's
analysis in the treebank and its relation to SAMA. (See Sections 2a and 2b
above.)

============================================================================
4b. Description of the "Integrated" format
============================================================================

The goal of this format is to bring together in one place:
1) the information about the source tokens from the pos/before-treebank files,
including the explicit mapping between the source and tree tokens.
2) the information about the tree tokens from the pos/after-treebank files
3) the tree structure

The basic format of each file is:

FILEPREFIX:
file metadata (beginning with ;;)

CHUNK: filename:chunk#
chunk metadata (beginning with ;;)
#source tokens: #
#tree tokens: #
#trees: 1
listing of source tokens
listing of tree tokens

TREE: filename:chunk#:tree#:#_tokens_in_tree
tree with W# instead of the tree tokens

with the CHUNK,TREE sections repeated.

(The file and chunk metadata is taken directly from the Annotation Graph file
and can be ignored or used as the user likes.) 

Each CHUNK corresponds to one "Paragraph" in the usual release terminology.
(It is possible that in some versions of the treebank more than one tree may
be associated with a paragraph, which is why there is a slot for
"# trees" after the # of source tokens, and in the TREE: line. In this release
it is always 1 tree per chunk, however, with TOP wrapped around it.)

Each source token row consists of the following 7 items, separated by the 
character U+00B7:

1) s:# - the source token #
2) the source token text in utf8, corresponding to the source text.
3) the source token text, in Buckwalter transliteration 
4) the starting offset
5) the ending offset
6) the start index of the corresponding tree token(s)
7) the end index of the corresponding tree token(s)
8) the status of this token with respect to SAMA, as discussed in Section 2b
above.
9) The lemma for this source token
10) The unsplit vocalization for this source token
11) A status indicating whether this source token is mapped to corresponding
   tree tokens.  (All "OK")

These fields correspond to the information in the pos/before-treebank files
as follows:

1)   <-> INDEX 
     (here counting from 0, in the pos/before-treebank file counting from 1)
2)   <-> INPUT STRING
3)   <-> IS_TRANS
4,5) <-> OFFSETS
6,7) <-> TOKENS  
     (here counting from 0, in the pos/before-treebank file counting from 1)
8)   <-> STATUS
9)   <-> LEMMA
10)  <-> UNSPLITVOC

Each tree token row consists of the following 13 items:

1) t:# - the tree token #
2) POS tag
3) "f" or "t" - a boolean indicating whether this token was split
    from the previous tree token
4) "f" or "t" - a boolean indicating whether this token was split 
    from the following tree token
5) vocalized form
6) gloss
7) offset start
8) offset end
9) text in utf8, corresponding to the source .tdf text.
10) unvocalized form
11) comment

These fields correspond to the information in the pos/after-treebank files
as follows:

1)     <-> INDEX
       (here counting from 0, in the pos/before-treebank file counting from 1)
2)     <-> POS
3,4,5) <-> VOCALIZED
       (with the separate boolean split information as hyphens on the
       VOCALIZED form.)
6)     <-> GLOSS
7,8)   <-> OFFSETS
9)     <-> INPUT STRING
10)    <-> UNVOCALIZED
11)    <-> COMMENT


For example if a file has a chunk with:
#source tokens:27
#tree tokens:32
#trees:1
s:0  ·واوضح·wAwDH·0·5·0·1·1·[>awoDaH_1]·(wa>awoDaHa)·OK
[...]
t:0  ·CONJ·f·t·wa·and·nolemma·0·1·و·w·[]
t:1  ·PV+PVSUFF_SUBJ:3MS·t·f·>awoDaH+a·clarify/explain/indicate + he/it [verb]·nolemma·1·5·اوضح·>wDH·[]
[...]

TREE:AAW_ARB_20080502.0027-S1:2:1:32
(TOP (S W0 (VP W1 (NP-SBJ (-NONE- *)) (SBAR W2 (S (NP-TPC-1 W3 (NP W4)) (VP (PRT W5) W6 (NP-SBJ-1 (-NONE- *T*)) (PP-CLR W7 (NP (NP (ADJP W8 (NP W9))) (NP-ADV W10 (NP (NP W11) (SBAR (SBAR (WHNP-2 W12) (S (VP W13 (NP-SBJ-2 (-NONE- *T*)) (NP-OBJ-2 (-NONE- *)) (PP-CLR W14 (NP W15)) (PP W16 (NP (NP W17 (NP W18)) (PP W19 (NP W20 (NP (NP W21 W22) (ADJP W23))))))))) W24 W25 (SBAR (WHNP-3 (-NONE- *0*)) (S (NP-SBJ W26 (NP (NP W27) (NP-3 (-NONE- *T*)))) (NP-PRD W28 (NP W29 (NP W30)))))))))))))) W31))


This indicates that:
1) The source token#0 maps to tree tokens #s 0 and 1 .
The source token text is wAwDH, and the two corresponding tokens 
are wa/CONJ and >awoDAH+a/PV+PVSUFF_SUBJ:3MS. They are represented 
in the VOCALIZED field in the pos/after-treebank file with
wa- and ->awoDaH+a, whereas here the hyphen is indicated by the
f/t and t/f values in t:0 and t:1 (Of course this information is also
redundant with the mapping from the source token.)
2) The source token offset <0,5> for the source token wAwDH
has been partitioned into <0,1> for wa and <1,5> for >awoDaH+a.
3) The leaves W0 and W1 in the tree correspond to the tree tokens 
wa- and ->awoDaH+a.

============================================================================
4c. A note about multiple trees on one line
============================================================================

It is possible for one line in the .tree files to include more than
one complete tree.  The reason for this is that the annotators work on
one "Paragraph" (CHUNK) at a time - e.g., a tree with the root "Paragraph"
node, as can be seen by looking at the "treebanking" feature in the
xml files.  When the trees are generated, the "Paragraph" node is
dropped.  If the form of the annotation was (Paragraph S1 S2), where
S1 and S2 are both complete trees, then they will appear on one line.

This is also true for the integrated format, except that in that
format "TOP" appears instead of "Paragraph."