This release follows earlier treebank releases in that we include information indicating the relationship with the morphological analyzer. However, due to the nature of this corpus, this relationship is now more complicated, containing references to both the SAMA 3.1 Morphological Analyzer (LDC2010L01), for the MSA tokens, and the CALIMA v0.5 Morphological Analyzer, for the ARZ tokens. This is discussed in more detail below. For more information on CALIMA, see Nizar Habash, Ramy Eskander, and Abdelati Hawwari A Morphological Analyzer for Egyptian Arabic Special Interest Group on Computational Morphology and Phonology, 2012 Briefly, "source tokens" are the whitespace/punctuation-delimited tokens (offset annotation) on the source text that receive a morphological analysis through the SAMA analyzer. The "tree tokens" result from splitting up these source tokens into subsequences as appropriate for the annotation of syntactic structure. In this release, there are 153,171 source tokens and 182,965 tree tokens. This terminology, along with information about the UNVOCALIZED and INPUT STRING forms, is discussed more in this paper, included in this release: Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank Seth Kulick, Ann Bies, Mohamed Maamouri LREC 2010 http://papers.ldc.upenn.edu/LREC2010/KulickBiesMaamouri-LREC2010.pdf The annotation in this release was done simultaneously with development of the morphological analyzer. Therefore there are some inevitable inconsistencies existing in the data between the part-of-speech/vocalization/lemma solutions and morphological analyzer solutions. ================================================================== Contents: 1. su.xml files converted tdf files for annotation 2. Solution status field 3. File extensions and directory structure 4. Description of the "Integrated" format 5. Additional Information 5a. data/xml/pos/FILE.xml not included 5b. A note about multiple trees on one line 5c. Non-ASCII punctuation characters ================================================================== ================================================================== 1. su.xml files converted tdf files for annotation ================================================================== The source files used for this corpus are the xml files included in the data/su_xml directory. For purposes of POS annotation, these were converted to the .tdf files included in the data/tdf directory. The conversion used a simple script that searched for tags and captured two pieces of data: the "id" attribute and the text content of the element. The "id" element was placed in the "file" column of the TDF file and the text content was placed in the "transcript" column. Other columns of the TDF were filled with predictable data. Leading whitespace in the text from su.xml was stripped for inclusion in the tdf file. ================================================================== 2. Solution status field ================================================================== Each file in pos/before has a set of entries for each source token. One entry is the STATUS field. This consists of 6 fields of information: has_sol: T if the token has a solution, F if it doesn't (i.e., if the POS tag is NO_FUNC). If has_sol=F, then all the following fields are set to . (period), except for orig, which remains as ARZ, although it can be ignored in this case. excluded: T if the token is such that it is not checked to have a matching solution in the tables for either analyzer; F if it is not checked. If excluded=T, then all the following fields are set to . (period), except for orig, which remains as ARZ, although it can be ignored in this case. orig: ARZ for Egyptian Arabic, or MSA for Modern Standard Arabic. sama: T if the token's annotation exactly matches a solution in the SAMA analyzer, and F otherwise. Here "exactly matches" means that there is a SAMA solution with matching POS, VOC, LEMMA, and UNSPLIT_VOC (GLOSS is not checked). calima_all: T if the token's annotation exactly matches a solution in the CALIMA analyzer, and F otherwise. Here "exactly matches" means that there is a CALIMA solution with matching POS, VOC, LEMMA, and UNSPLIT_VOC (GLOSS is not checked). calima_pv: Three possible values. If calima_all=T, then calima_pv=.(period). Otherwise, calima_pv=T if the token's annotation matches a solution in the CALIMA analyzer for POS and VOC (i.e., LEMMA, UNSPLIT_VOC, and GLOSS are not checked), and calima_pv=F if not. The tokens overall are characterized as follows: no solution 582 (has_sol=F) excluded 29638(has_sol=T excluded=T) MSA 424 (has_sol=T excluded=F orig=MSA) ARZ 122527 (has_sol=T excluded=F orig=ARZ) total # tokens 153171 The excluded tokens are for POS tags such as PUNC, TYPO, NOUN_NUM and ADJ_NUM if the latter two consist of all digits, etc. The MSA tokens are categorized as follows: sama=T 424 (has_sol=T excluded=F orig=MSA sama=T) sama=F 0 (has_sol=T excluded=F orig=MSA sama=F) ---- 424 The ARZ tokens are categorized as follows: calima_all=T 100895 (has_sol=T excluded=F orig=ARZ calima_all=T) calima_pv=T 7649 (has_sol=T excluded=F orig=ARZ calima_all=F calima_pv=T) calima_pv=F 13983 (has_sol=T excluded=F orig=ARZ calima_all=F calima_pv=F) ------ 122527 ============================================================================ 3. File extensions and directory structure ============================================================================ Each FILE in docs/file.ids has a corresponding file in the following directories. data/su_xml/FILE.su.xml (utf-8) su.xml file (see Section 1) data/tdf/FILE.tdf (utf-8) tdf file (see Section 1) data/pos/before/FILE.txt (utf-8) Information about the "source tokens" used for analysis with SAMA. So this is a listing of each token before clitic-separation. Each token contains the following information: ----------------------------------------------------------- INPUT STRING: (utf-8 characters from .tdf file) IS_TRANS: (Buckwalter transliteration of previous, used for input to CALIMA and SAMA) INDEX: (automatically assigned index, based on paragraph&word#) OFFSETS: (start-end - pair of integers - offset into tdf file, corresponding to the INPUT STRING) TOKENS: (start-end - two indices indicating the tree tokens in the corresponding pos/after/FILE.txt file that correspond to this source token) STATUS: (the status of this solution with respect to the analyzers, as discussed in Section 2) COMMENT: (a comment associated with this source token) LEMMA: (the lemma associated with this source token and solution in CALIMA or SAMA) UNSPLITVOC: (the vocalized form (not separated into segments) of the source token solution, from CALIMA or SAMA) POS: (POS for this source token) VOC: (vocalization for this source token) GLOSS: (gloss for this source token) ----------------------------------------------------------- The POS, VOC, and GLOSS fields are redundant with the respective values of the corresponding tree tokens. data/xml/treebank/FILE.xml The Annotation Graph .xml file for the tree tokens. data/pos/after/FILE.txt Information about each tree token in the corresponding xml/treebank FILE.xml file. Each token contains the following information: ----------------------------------------------------------- INPUT STRING: (utf-8 characters from .tdf file) IS_TRANS: (Buckwalter transliteration of previous) COMMENT: (annotator comment about word) INDEX: (automatically assigned index, based on paragraph&word#) OFFSETS: (start,end - pair of integers - Annotation Graph offset into tdf file, corresponding to the INPUT STRING) UNVOCALIZED: (the unvocalized form of the word) VOCALIZED: (the vocalized form of the word, taken from the solution) POS: (the pos tag, taken from the solution) GLOSS: (the gloss, taken from the solution) ----------------------------------------------------------- For more information about the derivation of the INPUT STRING and UNVOCALIZED fields for clitic separated tree tokens, see the paper "Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank", mentioned at the beginning of this readme. data/penntree/without-vowel/FILE.tree Penn Treebanking style output, generated from the integrated file. Each terminal is of the form (pos word), where pos and word correspond to the POS and UNVOCALIZED values for the corresponding token in pos/after/FILE.txt, respectively. data/penntree/with-vowel/FILE.tree Penn Treebanking style output, generated from the integrated file. Each terminal is of the form (pos word), where pos and word correspond to the POS and VOCALIZED values for the corresponding token in pos/after/FILE.txt, respectively. data/integrated/FILE.txt See section 4 for a description of the integrated format. ============================================================================ 4. Description of the "Integrated" format ============================================================================ The goal of this format is to bring together in one place: 1) the information about the source tokens from the pos/before files, including the explicit mapping between the source and tree tokens. 2) the information about the tree tokens from the pos/after files 3) the tree structure The basic format of each file is: FILEPREFIX: file metadata (beginning with ;;) CHUNK: filename:chunk# chunk metadata (beginning with ;;) #source tokens: # #tree tokens: # #trees: 1 listing of source tokens listing of tree tokens TREE: filename:chunk#:tree#:#_tokens_in_tree tree with W# instead of the tree tokens with the CHUNK,TREE sections repeated. (The file and chunk metadata is taken directly from the Annotation Graph file and can be ignored or used as the user likes.) Each CHUNK corresponds to one "Paragraph" in the usual release terminology. (It is possible that in some versions of the treebank more than one tree may be associated with a paragraph, which is why there is a slot for "# trees" after the # of source tokens, and in the TREE: line. In this release it is always 1 tree per chunk, however, with TOP wrapped around it.) Each source token row consists of the following 17 items, separated by the character U+00B7: 1) s:# - the source token # 2) the source token text in utf8, corresponding to the source text. 3) the source token text, in Buckwalter transliteration 4) the starting offset 5) the ending offset 6) the start index of the corresponding tree token(s) 7) the end index of the corresponding tree token(s) 8) a field reserved for future use, which should be ignored 9) a field reserved for future use, which should be ignored 10) a field reserved for future use, which should be ignored 11) a field reserved for future use, which should be ignored 12) a field reserved for future use, which should be ignored 13) an annotator comment for this source token 14) the status of this token with respect to SAMA, as discussed in Section 1 above. This is encoded as a 6 character string: character 1 is "has_sol" character 2 is "excluded" character 3 is "orig" (A for ARZ, M for MSA) character 4 is "sama" character 5 is "calima_all" character 6 is "calima_pv" 15) The lemma for this source token 16) The unsplit vocalization for this source token 17) A status indicating whether this source token is mapped to corresponding tree tokens. (All "OK") These fields correspond to the information in the pos/before files as follows: 1) <-> INDEX (here counting from 0, in the pos/before file counting from 1) 2) <-> INPUT STRING 3) <-> IS_TRANS 4,5) <-> OFFSETS 6,7) <-> TOKENS (here counting from 0, in the pos/before file counting from 1) 14) <-> STATUS 15) <-> LEMMA 16) <-> UNSPLITVOC Each tree token row consists of the following 13 items: 1) t:# - the tree token # 2) POS tag 3) "f" or "t" - a boolean indicating whether this token was split from the previous tree token 4) "f" or "t" - a boolean indicating whether this token was split from the following tree token 5) vocalized form 6) gloss 7) offset start 8) offset end 9) text in utf8, corresponding to the source .tdf text 10) unvocalized form 11) comment These fields correspond to the information in the pos/after files as follows: 1) <-> INDEX (here counting from 0, in the pos/after file counting from 1) 2) <-> POS 3,4,5) <-> VOCALIZED (with the separate boolean split information as hyphens on the VOCALIZED form) 6) <-> GLOSS 7,8) <-> OFFSETS 9) <-> INPUT STRING 10) <-> UNVOCALIZED 11) <-> COMMENT ============================================================================ 5. Additional Information ============================================================================ ============================================================================ 5a. data/xml/pos/FILE.xml not included ============================================================================ As in recent releases, we do not include the Annotation Graph .xml file for the source token file. ============================================================================ 5b. A note about multiple trees on one line ============================================================================ It is possible for one line in the .tree files to include more than one complete tree. The reason for this is that the annotators work on one "Paragraph" (CHUNK) at a time - e.g., a tree with the root "Paragraph" node, as can be seen by looking at the "treebanking" feature in the xml files. When the trees are generated, the "Paragraph" node is dropped. If the form of the annotation was (Paragraph S1 S2), where S1 and S2 are both complete trees, then they will appear on one line. This is also true for the integrated format, except that in that format "TOP" appears instead of "Paragraph." ============================================================================ 5c. Non-ASCII punctuation characters ============================================================================ It's possible that the source text may contain non-ASCII punctuation characters. This can be somewhat problematic for the Buckwalter transliteration, since there is no mapping for these characters. In these cases, we have indicated the Unicode value in the IS_TRANS field in the POS file. For example, in ar_4510_0107-0242.su.txt INPUT STRING: รท IS_TRANS: ÷ INDEX: P86W2 OFFSETS: 1-2 TOKENS: P86W2-P86W2 STATUS: has_sol=F excluded=. orig=ARZ sama=. calima_all=. calima_pv=. COMMENT: [] LEMMA: None UNSPLITVOC: (÷) POS: NO_FUNC VOC: ? GLOSS: nogloss