(An updated version of this page, listing known issues,
will be kept at http://projects.ldc.upenn.edu/ArabicTreebank/)


As noted in the readme-files.txt, we now categorize all source tokens
with a status value of 1,2,3, or 4, depending on their relation 
with SAMA.  To repeat the information from readme-files.txt, these
tokens are categorized as:

STATUS 1: 122328
STATUS 2:    178
STATUS 3:   1289
STATUS 4:  21591
         =======
          145386

Thus, excluding the punctuation and numeric tokens that receive
status 4, 122328/123795(=122328+178+1289) = 98.8% of the tokens
have status 1.  We have closely examined the most frequent instances of
tokens with status 3, and and have corrected all the ones that had mild
differences from the proper solution in SAMA.  A residue is left that
we comment on here.  Each word is listed as it appears in the source
text, and the number of times it occurs.  We only include here tokens
that occur 10 times or more, although other less frequently occurring
tokens fall into some of the following groups.


----------------------
Non-initial "A" cases:
----------------------
These are cases of the "missing hamza" problem with
a non-initial "A".  The desired solutions are currently
missing in SAMA 3.1

  62 lAn
  55 bAn
  23 bAnh
  21 lAnh
  20 bAnhA
  16 bdAt
  15 b$An

---------------------------
Correct solution not in SAMA
---------------------------
There are other cases for which the correct solution is missing
from SAMA 3.1 and needs a new entry:

  13 ldynA
   9 ldyhA
   8 ldyh

These appear as NOUN+PRON in SAMA 3.1, when they should be
NOUN+POSS_PRON.


---------------------------
Change required to treebank
---------------------------

  30 wAHdp
  30 wAHd
  28 AlvlAvp
  18 AlvlAv
  13 AlArbEp
  10 AlmtwsT

These are cases in which some instances of these tokens in the current
segment have a morphological/pos solution that should be changed to be
consistent with a solution in SAMA (other instances of these tokens are
already consistent with SAMA, and so have status 1). In general, these
are changes relating to NOUN and ADJ, and correcting these cases would
require changes both to the tree and tokens.  We have decided to leave
this set of cases for the next revision of this segment.


=======================================
A note about NOUN_PROP characterization
=======================================

The STATUS categories are based on running SAMA in the "extended" mode,
in which it can return a default solution.  If the solution in the 
treebank is that default solution, then this counts as "STATUS 1".  

For example, the solution in 20000715_AFP_ARB.0012.txt
     IS_TRANS: ErfAt
        INDEX: P9W26
      OFFSETS: 157-163
       TOKENS: P9W31-P9W31
       STATUS: 1
        LEMMA: [EarafAt_1]
   UNSPLITVOC: (EarafAt)
          POS: NOUN_PROP
          VOC: EarafAt
        GLOSS: Arafat

and the solution at 20000815_AFP_ARB.0017.txt
     IS_TRANS: EryqAt
        INDEX: P8W29
      OFFSETS: 167-173
       TOKENS: P8W33-P8W33
       STATUS: 1
        LEMMA: [DEFAULT]
   UNSPLITVOC: (EryqAt)
          POS: NOUN_PROP
          VOC: EryqAt
        GLOSS: NOT_IN_LEXICON

are both "STATUS: 1", although the former comes from a solution in the
SAMA tables, while the latter is just the default solution.  

There are 17168 NOUN_PROPs with STATUS 1, of which 1458 (8.5%) have the 
DEFAULT solution.  In future we may modify the definition of the STATUS
categories to better reflect this difference.