(An updated version of this page, listing known issues, will be kept at http://projects.ldc.upenn.edu/ArabicTreebank/) As noted in the readme-files.txt, we now categorize all source tokens with a status value of 1,2,3, or 4, depending on their relation with SAMA. To repeat the information from readme-files.txt, these tokens are categorized as: STATUS 1: 122328 STATUS 2: 178 STATUS 3: 1289 STATUS 4: 21591 ======= 145386 Thus, excluding the punctuation and numeric tokens that receive status 4, 122328/123795(=122328+178+1289) = 98.8% of the tokens have status 1. We have closely examined the most frequent instances of tokens with status 3, and and have corrected all the ones that had mild differences from the proper solution in SAMA. A residue is left that we comment on here. Each word is listed as it appears in the source text, and the number of times it occurs. We only include here tokens that occur 10 times or more, although other less frequently occurring tokens fall into some of the following groups. ---------------------- Non-initial "A" cases: ---------------------- These are cases of the "missing hamza" problem with a non-initial "A". The desired solutions are currently missing in SAMA 3.1 62 lAn 55 bAn 23 bAnh 21 lAnh 20 bAnhA 16 bdAt 15 b$An --------------------------- Correct solution not in SAMA --------------------------- There are other cases for which the correct solution is missing from SAMA 3.1 and needs a new entry: 13 ldynA 9 ldyhA 8 ldyh These appear as NOUN+PRON in SAMA 3.1, when they should be NOUN+POSS_PRON. --------------------------- Change required to treebank --------------------------- 30 wAHdp 30 wAHd 28 AlvlAvp 18 AlvlAv 13 AlArbEp 10 AlmtwsT These are cases in which some instances of these tokens in the current segment have a morphological/pos solution that should be changed to be consistent with a solution in SAMA (other instances of these tokens are already consistent with SAMA, and so have status 1). In general, these are changes relating to NOUN and ADJ, and correcting these cases would require changes both to the tree and tokens. We have decided to leave this set of cases for the next revision of this segment. ======================================= A note about NOUN_PROP characterization ======================================= The STATUS categories are based on running SAMA in the "extended" mode, in which it can return a default solution. If the solution in the treebank is that default solution, then this counts as "STATUS 1". For example, the solution in 20000715_AFP_ARB.0012.txt IS_TRANS: ErfAt INDEX: P9W26 OFFSETS: 157-163 TOKENS: P9W31-P9W31 STATUS: 1 LEMMA: [EarafAt_1] UNSPLITVOC: (EarafAt) POS: NOUN_PROP VOC: EarafAt GLOSS: Arafat and the solution at 20000815_AFP_ARB.0017.txt IS_TRANS: EryqAt INDEX: P8W29 OFFSETS: 167-173 TOKENS: P8W33-P8W33 STATUS: 1 LEMMA: [DEFAULT] UNSPLITVOC: (EryqAt) POS: NOUN_PROP VOC: EryqAt GLOSS: NOT_IN_LEXICON are both "STATUS: 1", although the former comes from a solution in the SAMA tables, while the latter is just the default solution. There are 17168 NOUN_PROPs with STATUS 1, of which 1458 (8.5%) have the DEFAULT solution. In future we may modify the definition of the STATUS categories to better reflect this difference.