(An updated version of this page, listing known issues, will be kept at http://projects.ldc.upenn.edu/ArabicTreebank/) As noted in the readme-files.txt, we now categorize all source tokens with a status value of 1,2,3, or 4, depending on their relation with SAMA. To repeat the information from readme-files.txt, these tokens are categorized as: STATUS 1: 287282 STATUS 2: 949 STATUS 3: 4323 STATUS 4: 47156 ======= 339710 Thus, excluding the punctuation and numeric tokens that receive status 4, 287282/(287282+949+4323)=287282/292554 = 98.2% of the tokens have status 1. We have closely examined the most frequent instances of tokens with status 3, and and have corrected all the ones that had mild differences from the proper solution in SAMA. A residue is left that we comment on here. Each word is listed as it appears in the source text, and the number of times it occurs. We only include here tokens that occur 15 times or more, although other less frequently occurring tokens fall into some of the following groups. ---------------------- Non-initial "A" cases: ---------------------- These are cases of the "missing hamza" problem with a non-initial "A". The desired solutions are currently missing in SAMA 3.1 91 lAn 39 bAnh 39 bAn 37 lAnh 21 lAnhA 17 bAnhA --------------------------- Correct solution not in SAMA --------------------------- There are other cases for which the correct solution is missing from SAMA 3.1 and needs a new entry: 27 ldyh 19 ldynA 19 ldyhA 18 ldyhm These appear as NOUN+PRON in SAMA 3.1, when they should be NOUN+POSS_PRON. 77 AyDAF 15 >yDAF There is a "hole" in SAMA 3.1, such that the solution that appears for AyDA and >yDA does not appear when the "F" is included in the input string. 23 ynbgy The IV-based solution for ynbgy, as present in ATB3, is missing in SAMA 3.1. --------------------------- Change required to treebank --------------------------- 75 wAHdp 67 wAHd 59 AlvlvA' 57 Alkvyr 41 AlvlAvp 16 bkvyr 15 kvyrA These are cases in which some instances of these tokens in the current segment have a morphological/pos solution that should be changed to be consistent with a solution in SAMA (other instances of these tokens are already consistent with SAMA, and so have status 1). In general, these are changes relating to NOUN and ADJ, and correcting these cases would require changes both to the tree and tokens. We have decided to leave this set of cases for the next revision of this segment. --------------------------- miscellaneous --------------------------- In the annotation for ANN20020515.0082, this solution (from the pos/before file) is an error, since the given solution cannot be the solution for wb_. IS_TRANS: wb_ INDEX: P13W48 OFFSETS: 248-251 TOKENS: P13W54-P13W56 STATUS: 3 LEMMA: None UNSPLITVOC: None POS: CONJ+PREP+NOUN VOC: wa+bi+duwlArAtK GLOSS: and+by/with+dollars