This release follows earlier treebank releases in that we include information indicating the relationship with the morphological analyzer. However, due to the nature of this corpus, this relationship is now more complicated, containing references to both the SAMA 3.1 Morphological Analyzer (LDC2010L01), for the MSA tokens, and the CALIMA v0.5 Morphological Analyzer, for the ARZ tokens. This is discussed in more detail below. Briefly, "source tokens" are the whitespace/punctuation-delimited tokens (offset annotation) on the source text that receive a morphological analysis through the SAMA analyzer. The "tree tokens" result from splitting up these source tokens into subsequences as appropriate for the annotation of syntactic structure. In this release, there are 400,448 source tokens and 508,548 tree tokens. This terminology, along with information about the UNVOCALIZED and INPUT STRING forms, is discussed more in this paper, included in this release: Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank Seth Kulick, Ann Bies, Mohamed Maamouri LREC 2010 http://papers.ldc.upenn.edu/LREC2010/KulickBiesMaamouri-LREC2010.pdf The annotation in this release was done simultaneously with development of the morphological analyzer. Therefore there are some inevitable inconsistencies existing in the data between the part-of-speech/vocalization/lemma solutions and morphological analyzer solutions. These will reconciled in a future release. This also affects the generation of the INPUT STRING forms for the tree tokens, which will be upgraded in future releases. ================================================================== Contents: 1. .su.xml files converted tdf files for annotation 2. Solution status field 3. Synchronization with CALIMA 4. File extensions and directory structure 5. Description of the "Integrated" format 6. Additional Information 7a. data/xml/pos/FILE.xml not included 7b. A note about multiple trees on one line 7c. Non-ASCII punctuation characters 7d. Characters not included in tokens ================================================================== ================================================================== 1. .su.xml files converted tdf files for annotation ================================================================== The source files used for this corpus are the .su.xml files included in the data/su_xml directory. For purposes of POS annotation, these were converted to the .tdf files included in the data/tdf directory. The conversion used a simple script that searched for tags and captured two pieces of data: the "id" attribute and the text content of the element. The "id" element was placed in the "file" column of the TDF file and the text content was placed in the "transcript" column. Other columns of the TDF were filled with predictable data. Leading whitespace in the text from su_xml was stripped for inclusion in the tdf file. For convenience, we include a file token-mapping.txt that makes explicit the linkage between the annotation files, the .tdf files, and the .su.xml file. There are 8 columns: 1) The location of the token, of the form filename:paragraph #:token#. This corresponds to the pos/before file. 2) Token offsets, again as in the pos/before file. 3) The source text as included in the INPUT STRING field in the pos/before file. 4) The text taken directly from the .tdf file. 5) The id of this paragraph in the su_xml file. 6) The corresponding offsets of this token in the su_xml file. 7) The text taken directly from the su_xml file. 8) A potentially empty column, that will have the text "file/tdf difference" if there is a difference between the text as in the INPUT STRING field in the pos/before file and as in the tdf file. In most cases the offsets are the same in columns 2 and 6, but may be different if the leading whitespace was trimmed before being used in the .tdf file. There are a few cases in which column 8 does indicate a file/tdf difference. These concern the characters Arabic question mark, Arabic comma, and tatweel, which have traditionally been mapped to the corresponding ascii characters in ATB releases. ================================================================== 2. Solution status field ================================================================== Each file in pos/before has a set of entries for each source token. One entry is the STATUS field. This consists of 6 fields of information: has_sol: T if the token has a solution, F if it doesn't (i.e., if the POS tag is NO_FUNC). If has_sol=F, then all the following fields are set to . (period), except for orig, which remains as ARZ, although it can be ignored in this case. excluded: T if the token is such that it is not checked to have a matching solution in the tables for either analyzer; F if it is not checked. If excluded=T, then all the following fields are set to . (period), except for orig, which remains as ARZ, although it can be ignored in this case. orig: ARZ for Egyptian Arabic, or MSA for Modern Standard Arabic. sama: T if the token's annotation exactly matches a solution in the SAMA analyzer, and F otherwise. Here "exactly matches" means that there is a SAMA solution with matching POS, VOC, LEMMA, and UNSPLIT_VOC (GLOSS is not checked). calima_all: T if the token's annotation exactly matches a solution in the CALIMA analyzer, and F otherwise. Here "exactly matches" means that there is a CALIMA solution with matching POS, VOC, LEMMA, and UNSPLIT_VOC (GLOSS is not checked). calima_pv: Three possible values. If calima_all=T, then calima_pv=.(period). Otherwise, calima_pv=T if the token's annotation matches a solution in the CALIMA analyzer for POS and VOC (i.e., LEMMA, UNSPLIT_VOC, and GLOSS are not checked), and calima_pv=F if not. The tokens overall are characterized as follows: no solution 5199 (has_sol=F) excluded 40294 (has_sol=T excluded=T) MSA 9682 (has_sol=T excluded=F orig=MSA) ARZ 345273 (has_sol=T excluded=F orig=ARZ) total # tokens 349414 The excluded tokens are for POS tags such as PUNC, TYPO, NOUN_NUM and ADJ_NUM if the latter two consist of all digits, etc. The MSA tokens are categorized as follows: sama=T 9164 (has_sol=T excluded=F orig=MSA sama=T) sama=F 518 (has_sol=T excluded=F orig=MSA sama=F) ---- 9682 The ARZ tokens are categorized as follows: calima_all=T 294641 (has_sol=T excluded=F orig=ARZ calima_all=T) calima_pv=T 4663 (has_sol=T excluded=F orig=ARZ calima_all=F calima_pv=T) calima_pv=F 45969 (has_sol=T excluded=F orig=ARZ calima_all=F calima_pv=F) ------ 345273 Therefore 85.3% of the ARZ source tokens (294641/345273) are a complete match with CALIMA. ============================================================================ 3. Synchronization with CALIMA ============================================================================ The treebank was developed together with the CALIMA analyzer. Therefore some annotations used for the treebank differed from later CALIMA representations of the same solution. We have implemented a procedure to automatically correct such cases, to maximize the synchronization of the treebank annotations and CALIMA solutions. For each source token in the treebank that is classified as ARZ, and for which the solution was not a complete match with CALIMA (calima_all=F), we did the following: 1. If the POS,VOC,LEMMA fields all match a solution in CALIMA for that word, and there is only one such matching solution, then use the UNSPLIT_VOC from that solution. That is, it matched a solution in CALIMA except for the UNSPLIT_VOC, and it was unambiguously determined by the POS,VOC,LEMMA what the UNSPLIT_VOC should be to make it a complete match with CALIMA. 2. If the POS and VOC fields match a solution in CALIMA for that word, and there is only one such matching solution, then use the LEMMA (and UNSPLIT_VOC if different) from that solution. 3. If the POS and LEMMA fields match a solution in CALIMA for that word, and there is only one such matching solution, then use the POS (and UNSPLIT_VOC if different) from that solution. 4. If the POS field field matches a solution in CALIMA for that word, and there is only one such matching solution, then use the POS,VOC, and LEMMA (and UNSPLIT_VOC if different) from that solution. This resulted in modifications to 73,752 tokens, increasing the synchronization from 64.0% to 85.3%. We cannot guarantee that every such change was an appropriate change, but on balance it seemed far desirable to increase the synchronization, given the issues caused by the overlap of analyzer development and treebank annotation. ============================================================================ 4. File extensions and directory structure ============================================================================ Each FILE in docs/file.ids has a corresponding file in the following directories. data/su_xml/FILE.su.xml (utf-8) su.xml file (see Section 1) data/tdf/FILE.tdf (utf-8) tdf file (see Section 1) data/pos/before/FILE.txt (utf-8) Information about the "source tokens" used for analysis with SAMA. So this is a listing of each token before clitic-separation. Each token contains the following information: ----------------------------------------------------------- INPUT STRING: (utf-8 characters from .tdf file) IS_TRANS: (Buckwalter transliteration of previous, used for input to CALIMA and SAMA) INDEX: (automatically assigned index, based on paragraph&word#) OFFSETS: (start-end - pair of integers - offset into tdf file, corresponding to the INPUT STRING) TOKENS: (start-end - two indices indicating the tree tokens in the corresponding pos/after/FILE.txt file that correspond to this source token) STATUS: (the status of this solution with respect to the analyzers, as discussed in Section 2) COMMENT: (a comment associated with this source token) LEMMA: (the lemma associated with this source token and solution in CALIMA or SAMA) UNSPLITVOC: (the vocalized form (not separated into segments) of the source token solution, from CALIMA or SAMA) POS: (POS for this source token) VOC: (vocalization for this source token) GLOSS: (gloss for this source token) ----------------------------------------------------------- The POS, VOC, and GLOSS fields are redundant with the respective values of the corresponding tree tokens. data/xml/treebank/FILE.xml The Annotation Graph .xml file for the tree tokens. data/pos/after/FILE.txt Information about each tree token in the corresponding xml/treebank FILE.xml file. Each token contains the following information: ----------------------------------------------------------- INPUT STRING: (utf-8 characters from .tdf file) IS_TRANS: (Buckwalter transliteration of previous) COMMENT: (annotator comment about word) INDEX: (automatically assigned index, based on paragraph&word#) OFFSETS: (start,end - pair of integers - Annotation Graph offset into tdf file, corresponding to the INPUT STRING) UNVOCALIZED: (the unvocalized form of the word) VOCALIZED: (the vocalized form of the word, taken from the solution) POS: (the pos tag, taken from the solution) GLOSS: (the gloss, taken from the solution) ----------------------------------------------------------- For more information about the derivation of the INPUT STRING and UNVOCALIZED fields for clitic separated tree tokens, see the paper "Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank", mentioned at the beginning of this readme. data/penntree/without-vowel/FILE.tree Penn Treebanking style output, generated from the integrated file. Each terminal is of the form (pos word), where pos and word correspond to the POS and UNVOCALIZED values for the corresponding token in pos/after/FILE.txt, respectively. data/penntree/with-vowel/FILE.tree Penn Treebanking style output, generated from the integrated file. Each terminal is of the form (pos word), where pos and word correspond to the POS and VOCALIZED values for the corresponding token in pos/after/FILE.txt, respectively. data/integrated/FILE.txt See section 4 for a description of the integrated format. ============================================================================ 6. Description of the "Integrated" format ============================================================================ The goal of this format is to bring together in one place: 1) the information about the source tokens from the pos/before files, including the explicit mapping between the source and tree tokens. 2) the information about the tree tokens from the pos/after files 3) the tree structure The basic format of each file is: FILEPREFIX: file metadata (beginning with ;;) CHUNK: filename:chunk# chunk metadata (beginning with ;;) #source tokens: # #tree tokens: # #trees: 1 listing of source tokens listing of tree tokens TREE: filename:chunk#:tree#:#_tokens_in_tree tree with W# instead of the tree tokens with the CHUNK,TREE sections repeated. (The file and chunk metadata is taken directly from the Annotation Graph file and can be ignored or used as the user likes.) Each CHUNK corresponds to one "Paragraph" in the usual release terminology. (It is possible that in some versions of the treebank more than one tree may be associated with a paragraph, which is why there is a slot for "# trees" after the # of source tokens, and in the TREE: line. In this release it is always 1 tree per chunk, however, with TOP wrapped around it.) Each source token row consists of the following 17 items, separated by the character U+00B7: 1) s:# - the source token # 2) the source token text in utf8, corresponding to the source text. 3) the source token text, in Buckwalter transliteration 4) the starting offset 5) the ending offset 6) the start index of the corresponding tree token(s) 7) the end index of the corresponding tree token(s) 8) a field reserved for future use, which should be ignored 9) a field reserved for future use, which should be ignored 10) a field reserved for future use, which should be ignored 11) a field reserved for future use, which should be ignored 12) a field reserved for future use, which should be ignored 13) an annotator comment for this source token 14) the status of this token with respect to SAMA, as discussed in Section 1 above. This is encoded as a 6 character string: character 1 is "has_sol" character 2 is "excluded" character 3 is "orig" (A for ARZ, M for MSA) character 4 is "sama" character 5 is "calima_all" character 6 is "calima_pv" 15) The lemma for this source token 16) The unsplit vocalization for this source token 17) A status indicating whether this source token is mapped to corresponding tree tokens. (All "OK") These fields correspond to the information in the pos/before files as follows: 1) <-> INDEX (here counting from 0, in the pos/before file counting from 1) 2) <-> INPUT STRING 3) <-> IS_TRANS 4,5) <-> OFFSETS 6,7) <-> TOKENS (here counting from 0, in the pos/before file counting from 1) 14) <-> STATUS 15) <-> LEMMA 16) <-> UNSPLITVOC Each tree token row consists of the following 13 items: 1) t:# - the tree token # 2) POS tag 3) "f" or "t" - a boolean indicating whether this token was split from the previous tree token 4) "f" or "t" - a boolean indicating whether this token was split from the following tree token 5) vocalized form 6) gloss 7) offset start 8) offset end 9) text in utf8, corresponding to the source .tdf text 10) unvocalized form 11) comment These fields correspond to the information in the pos/after files as follows: 1) <-> INDEX (here counting from 0, in the pos/after file counting from 1) 2) <-> POS 3,4,5) <-> VOCALIZED (with the separate boolean split information as hyphens on the VOCALIZED form) 6) <-> GLOSS 7,8) <-> OFFSETS 9) <-> INPUT STRING 10) <-> UNVOCALIZED 11) <-> COMMENT ============================================================================ 7. Additional Information ============================================================================ ============================================================================ 7a. data/xml/pos/FILE.xml not included ============================================================================ As in recent releases, we do not include the Annotation Graph .xml file for the source token file. ============================================================================ 7b. A note about multiple trees on one line ============================================================================ It is possible for one line in the .tree files to include more than one complete tree. The reason for this is that the annotators work on one "Paragraph" (CHUNK) at a time - e.g., a tree with the root "Paragraph" node, as can be seen by looking at the "treebanking" feature in the xml files. When the trees are generated, the "Paragraph" node is dropped. If the form of the annotation was (Paragraph S1 S2), where S1 and S2 are both complete trees, then they will appear on one line. This is also true for the integrated format, except that in that format "TOP" appears instead of "Paragraph." ============================================================================ 7c. Non-ASCII punctuation characters ============================================================================ It's possible that the source text may contain non-ASCII characters outside the range of the Buckwalter transliteration. In these cases, we have indicated the Unicode value in the IS_TRANS field in the POS file. For example, in bolt-arz-DF-175-182185-10963606.arz.su.txt: INPUT STRING: ♦ IS_TRANS: ♦ INDEX: P104W5 OFFSETS: 22-23 TOKENS: P104W6-P104W6 STATUS: has_sol=T excluded=T orig=ARZ sama=. calima_all=. calima_pv=. COMMENT: [] LEMMA: [DEFAULT] UNSPLITVOC: (♦) POS: PUNC VOC: ? GLOSS: ? ============================================================================ 7d. Characters not included in tokens ============================================================================ There are 3102 cases of sequences of characters in the source text that have not been included as tokens. The full listing of the 3102 cases is in the file not-included.txt.