This release continues to include the improvements in the organization of the data and certain aspects of the annotation that have been present in part or whole since ATB5. For details on these changes, particularly with regard to the INPUT STRING and UNVOCALIZED fields, and the STATUS field indicating the relationship with the SAMA 3.1 Morphological Analyzer (LDC2010L01), please see the following paper, also included in this release for your reading convenience: Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank Seth Kulick, Ann Bies, Mohamed Maamouri LREC 2010 The documentation below refers to "source tokens" and "tree tokens", as explained in detail in this paper. Briefly, "source tokens" are the whitespace/punctuation-delimited tokens (offset annotation) on the source text that receive a morphological analysis through the SAMA analyzer. The "tree tokens" result from splitting up these source tokens into subsequences as appropriate for the annotation of syntactic structure. Contents: 1. STATUS field totals 2. File Extensions and Directory Structure 3. Additional Information 3a. data/xml/pos/FILE.xml not included 3b. Description of the "Integrated" format 3c. A note about multiple trees on one line 3d. Metadata characters left out ======================================================= 1. Integration with SAMA ======================================================= A significant change from the previous release of this data is that information is now included making explicit the relation between each source token and the SAMA 3.1 Morphological analyzer, as detailed in the paper referenced above. We include here the totals for the treebank-SAMA consistency for this release. In this release, there are 432,976 source tokens, resulting in 517,080 tree tokens (after clitic splitting). These 432,976 source tokens have the following categorizations: STATUS 1: 415924 Included in SAMA STATUS 2: 735 Limited Solution STATUS 3: 3474 Pending SAMA Solution STATUS 4: 12843 Excluded from check with SAMA ================ 432976 Note: We have refined the definition of status 4 since the paper, so that all tokens with PARTIAL, DIALECT, TRANSERR, and FOREIGN are status 4. ============================================================================ 2. File Extensions and Directory Structure ============================================================================ Each FILE in docs/file.ids has a corresponding file in the following directories. data/sgm/FILE.sgm (utf-8) Source files. data/pos/before/FILE.txt (utf-8) Information about the tokens used for analysis with SAMA (the "source tokens"). This is a listing of each token before clitic-separation. Each token contains the following information: ----------------------------------------------------------- INPUT STRING: (utf-8 characters from .sgm file) IS_TRANS: (Buckwalter transliteration of previous, used for input to SAMA) INDEX: (automatically assigned index, based on paragraph&word#) OFFSETS: (start,end - pair of integers - Annotation Graph offset into sgm file, corresponding to the INPUT STRING) TOKENS: (start-end - two indices indicating the tree tokens in the corresponding pos/after/FILE.txt file that correspond to this source token) STATUS: (the status of this solution, with respect to SAMA) LEMMA: (the lemma associated with this source token and solution in SAMA) UNSPLITVOC: (the vocalized form (not separated into segments) of the source token solution, from SAMA) POS: (pos for this source token) VOC: (vocalization for this source token) GLOSS: (gloss for this source token) ----------------------------------------------------------- The POS, VOC, and GLOSS fields are redundant with the respective values of the corresponding tree tokens. data/xml/treebank/FILE.xml This Annotation Graph file consists of the "tree token" annotations. data/pos/after/FILE.txt Information about each tree token in the corresponding xml/treebank FILE.xml file. Each token contains the following information: ----------------------------------------------------------- INPUT STRING: (utf-8 characters from .sgm file) IS_TRANS: (Buckwalter transliteration of previous) COMMENT: (annotator comment about word) INDEX: (automatically assigned index, based on paragraph&word#) OFFSETS: (start,end - pair of integers - Annotation Graph offset into sgm file, corresponding to the INPUT STRING) UNVOCALIZED: (the unvocalized form of the word) VOCALIZED: (the vocalized form of the word, taken from the solution) POS: (the pos tag, taken from the solution) GLOSS: (the gloss, taken from the solution) ----------------------------------------------------------- For more information about the derivation of the INPUT STRING and UNVOCALIZED fields for clitic separated tree tokens, see the paper "Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank," mentioned at the beginning of this file. data/penntree/without-vowel/FILE.tree Penn Treebanking style output, generated from the xml/treebank .xml file. Each terminal is of the form (pos word), where pos and word correspond to the POS and UNVOCALIZED values for the corresponding token in pos/after/FILE.txt, respectively. data/penntree/with-vowel/FILE.tree Penn Treebanking style output, generated from the xml/treebank .xml file. Each terminal is of the form (pos word), where pos and word correspond to the POS and VOCALIZED values for the corresponding token in pos/after-treebank/FILE.txt, respectively. data/integrated/FILE.txt See section 3b for a description of the integrated format. (Note: The file formats in prior releases were somewhat different. Due to the nature of the ongoing improvements, the notes here refer only to this release. For similar information for previous releases, please see the corresponding documentation in those releases.) ============================================================================ 3. Additional Information ============================================================================ ============================================================================ 3a. data/xml/pos/FILE.xml not included ============================================================================ The data/xml/pos/FILE.xml file, if included, would contain the alternatives from the morphological analyzer at the time the analysis was originally done, which are sometimes then modified later in the annotation process. As a result, the POS information in the pos-level .xml files is not necessarily the same as in the treebank-level .xml files. To avoid confusion we therefore do not release the pos-level .xml files. Instead, the data/pos/before .txt files and the integrated files contain the information regarding the source token and now also with this release further information on the token's analysis in the treebank and its relation to SAMA. ============================================================================ 3b. Description of the "Integrated" format ============================================================================ The goal of this format is to bring together in one place: 1) the information about the source tokens from the pos/before files, including the explicit mapping between the source and tree tokens 2) the information about the tree tokens from the pos/after files 3) the tree structure The basic format of each file is: FILEPREFIX: file metadata (beginning with ;;) CHUNK: filename:chunk# chunk metadata (beginning with ;;) #source tokens: # #tree tokens: # #trees: 1 listing of source tokens listing of tree tokens TREE: filename:chunk#:tree#:#_tokens_in_tree tree with W# instead of the tree tokens with the CHUNK,TREE sections repeated. (The file and chunk metadata is taken directly from the Annotation Graph file and can be ignored or used as the user likes.) Each CHUNK corresponds to one "Paragraph" in the usual release terminology. (It is possible that in some versions of the treebank more than one tree may be associated with a paragraph, which is why there is a slot for "# trees" after the # of source tokens, and in the TREE: line. In this release it is always 1 tree per chunk, however, with TOP wrapped around it.) Each source token row consists of the following 7 items, separated by the character U+00B7: 1) s:# - the source token # 2) the source token text in utf8, corresponding to the source text 3) the source token text, in Buckwalter transliteration 4) the starting offset 5) the ending offset 6) the start index of the corresponding tree token(s) 7) the end index of the corresponding tree token(s) 8) the status of this token with respect to SAMA, as discussed in Section 1 above 9) the lemma for this source token 10) the unsplit vocalization for this source token 11) a status indicating whether this source token is mapped to corresponding tree tokens (All "OK") These fields correspond to the information in the pos/before files as follows: 1) <-> INDEX (here counting from 0, in the pos/before-treebank file counting from 1) 2) <-> INPUT STRING 3) <-> IS_TRANS 4,5) <-> OFFSETS 6,7) <-> TOKENS (here counting from 0, in the pos/before-treebank file counting from 1) 8) <-> STATUS 9) <-> LEMMA 10) <-> UNSPLITVOC Each tree token row consists of the following 11 items: 1) t:# - the tree token # 2) POS tag 3) "f" or "t" - a boolean indicating whether this token was split from the previous tree token 4) "f" or "t" - a boolean indicating whether this token was split from the following tree token 5) vocalized form 6) gloss 7) offset start 8) offset end 9) text in utf8, corresponding to the source .sgm text 10) unvocalized form 11) comment These fields correspond to the information in the pos/after files as follows: 1) <-> INDEX (here counting from 0, in the pos/before-treebank file counting from 1) 2) <-> POS 3,4,5) <-> VOCALIZED (with the separate boolean split information as hyphens on the VOCALIZED form) 6) <-> GLOSS 7,8) <-> OFFSETS 9) <-> INPUT STRING 10) <-> UNVOCALIZED 11) <-> COMMENT For example if a file has a chunk with: #source tokens:32 #tree tokens:35 #trees:1 s:0 ·واوضح·wAwDH·1·7·0·1·1·[>awoDaH_1]·(wa>awoDaHa)·OK [...] t:0 ·CONJ·f·t·wa·and·1·2·و·w·[Separated] t:1 ·PV+PVSUFF_SUBJ:3MS·t·f·>awoDaH+a·clarify/explain/indicate + he/it [verb]·2·7·اوضح·>wDH·[] [...] TREE:20000715_AFP_ARB.0009:2:1:35 (TOP (S W0 (VP W1 (NP-SBJ W2) (SBAR W3 (S (NP-TPC-2 (NP W4) (SBAR (WHNP-1 W5) (S (VP W6 (NP-SBJ-1 (-NONE- *T*)) (PP-DIR W7 (NP W8)) (NP-TMP (NP W9 (NP W10)) (NP W11)) (S-ADV (VP W12 (NP-SBJ (-NONE- *)) (PP-DIR W13 (NP W14)))))))) (VP (PRT W15) W16 (NP-SBJ-2 (-NONE- *T*)) (NP-ADV W17 (NP (NP (NP W18 (NP W19)) (ADJP W20)) (NP W21 W22 W23 W24))) (PP W25 (SBAR W26 (S (VP W27 (NP-OBJ W28) (NP-SBJ W29 W30 W31) (NP-TMP W32 (NP W33)))))))))) W34)) This indicates that: 1) The source token#0 maps to tree tokens #s 0 and 1. The source token text is wAwDH, and the two corresponding tokens are wa/CONJ and >awoDAH+a/PV+PVSUFF_SUBJ:3MS. They are represented in the VOCALIZED field in the pos/after file with wa- and ->awoDaH+a, whereas here the hyphen is indicated by the f/t and t/f values in t:0 and t:1. (Of course this information is also redundant with the mapping from the source token.) 2) The source token offset <1,7> for the source token wAwDH has been partitioned into <1,2> for wa and <2,7> for >awoDaH+a. 3) The leaves W0 and W1 in the tree correspond to the tree tokens wa- and ->awoDaH+a. ============================================================================ 3c. A note about multiple trees on one line ============================================================================ It is possible for one line in the .tree files to include more than one complete tree. The reason for this is that the annotators work on one "Paragraph" (CHUNK) at a time - e.g., a tree with the root "Paragraph" node, as can be seen by looking at the "treebanking" feature in the xml files. When the trees are generated, the "Paragraph" node is dropped. If the form of the annotation was (Paragraph S1 S2), where S1 and S2 are both complete trees, then they will appear on one line. This is also true for the integrated format, except that in that format "TOP" appears instead of "Paragraph." ============================================================================ 3d. Metadata characters left out e============================================================================ There are 8028 cases of sequences of characters in the source text that have not been included as tokens. These are generally metadata included in the source files. for example, in ABUDHABI_ABUDHNEWS_ARB_20070111_115801.qrtr:76, paragraph #76, the text at (53,64) is not included in the annotated tokens.