This release of this segment of the Arabic Treebank contains several improvements in the organization of the data and certain aspects of the annotation since the previous release of this segment. These changes are primarily: 1. Improvements have been made to the creation of the INPUT STRING tokens 2. Improvements have been made to the creation of the UNVOCALIZED tokens 3. An "integrated" file format has been included, bringing together in one place all of the information formerly spread out among different file formats and directories. This includes the tree structure information, the different forms of the tree tokens, and the relation between the tree tokens and source tokens. 4. A "Solution status" field is now included with each source token, making explicit the relation between each source token and the SAMA 3.1 Morphological Analyzer (LDC 2009E73). 5. The relation between a source token and its corresponding tree tokens is now made explicit, rather than only implicitly through offset information. Sections 1 and 2 describe these new changes, and the integrated format is described in detail in Section 4b. ================================================================== Contents: 1. Improvements to INPUT STRING and UNVOCALIZED tokens 1a. Background for changes 1b. Improvements to problematic INPUT STRING tokens 1c. Improvements to problematic UNVOCALIZED tokens 2. Additional information for morphological annotations 2a. "Solution status" field is now included 2b. Explicit mapping between source and tree tokens, and other changes in "before" pos file. 3. File Extensions and Directory Structure 4. Additional Information 4a. data/xml/pos/FILE.xml not included 4b. Description of the "Integrated" format 4c. A note about multiple trees on one line ======================================================= 1. Improvements to INPUT STRING and UNVOCALIZED tokens ======================================================= The improvements in this section were first initiated for ATB5, an additional segment of the Arabic Treebank that is being prepared for a public release in 2011. The specific file and token references in this section refer to data in ATB5, since this description of the improvements was originally written to refer to that data. However, the changes described here for that data also apply to the current release. Note that the work on ATB5, and hence this improved creation of the INPUT STRING and UNVOCALIZED values, was done between the previous release of ATB3 and this current release. ----------------------------------------------------------------- 1a. Background for changes. ----------------------------------------------------------------- There are two main parts to the treebank word-level tokenization. 1. The source text is broken up into roughly whitespace delimited tokens, henceforth called the "source tokens." These are the tokens that are run through the SAMA morphological analyzer, resulting in a vocalized form, and information on these tokens has traditionally been included (and still is) in the /data/pos/before-treebank directory. 2. These source tokens are split apart if appropriate during annotation (preposition prefixes, direct object suffixes, etc.). These tokens will henceforth be referred to as the "tree tokens," since these are the tokens actually used for treebanking. These tokens are traditionally included (and still are) in various formats in the data/xml/treebank, data/pos/after-treebank, and data/penntree/(with,without)-vowel directories. For all of the source tokens that receive solutions from SAMA, the treebank annotation takes place on the *vocalized* tree tokens, since those are the output of SAMA, sometimes split into separate tokens. The solution from SAMA is a sequence of segments, each including vocalization/POS/gloss information, and these segments are partitioned into one or more tree tokens that together correspond to the original source token. For example: ------------------------------------------------------------- One source token: unvocalized - original text e.g., yktbh vocalized - solution from SAMA e.g., [ya+kotub+u+hu, IV3MS+IV+IVSUFF_MOOD:I+IVSUFF_DO:3MS] he/it + write + [ind.] + it/him Two corresponding tree token(s): vocalized - vocalized source token potentially split e.g., [ya+kotub+u,IV3MS+IV+IVSUFF_MOOD:I] and [hu,IVSUFF_DO:3MS] ------------------------------------------------------------- Note that there is no other level of annotation of the tree token involved in this process -- the annotated tree tokens are the vocalized tokens. Therefore, any type of unvocalized tree token that is released is derived from from this annotation in some way. (The situation is different for the relatively infrequent tokens with solution status 2, as described in Section 2a.) In principle, it could be left to users to experiment with the relation between the source token (what is actually present in the source file) and the vocalized tree tokens (the end result of the annotation). However, in all previous releases of the Arabic Treebank corpora, two other forms of the tree tokens were released as well: 1. One, which we will call here here the INPUT STRING, was an attempt to split up the source token into substrings such that each substring corresponded to one of the vocalized tree tokens. For example, in the above example, the two tree tokens might have the INPUT STRINGs "yktb" and "h". 2. Also, an UNVOCALIZED form was included, which was a sort of a hybrid in earlier releases. For source tokens that were not split, the UNVOCALIZED form was identical to the INPUT STRING. For source tokens that were split, each UNVOCALIZED form was set to be simply the VOCALIZED form with diacritics removed. This hybrid nature of the UNVOCALIZED form is discussed more in this paper: (and also in section 1c below) Mohamed Maamouri, Seth Kulick, Ann Bies Diacritic Annotation in the Arabic Treebank and Its Impact on Parser Evaluation; LREC 2008, Marrakech, Morocco, May 28-30, 2008 http://papers.ldc.upenn.edu/LREC2008/Diacritic_Annotation_ATB.pdf While these derivative forms were supplied primarily for convenience, not as part of the annotation, we have endeavored in this release to fix all problems associated with the creation of these two forms. ----------------------------------------------------------------- 1b. Improvements to problematic INPUT STRING tokens ----------------------------------------------------------------- The algorithm used, prior to ATB5, to create the INPUT STRING tokens for the tree tokens sometimes created incorrect INPUT STRING tokens. (We use the phrase "INPUT STRING token" to mean the INPUT STRING value associated with some tree token; see Section 3 for definitions of all the values associated with each token.) For example, the source token "y>hlhA" might be given a solution resulting in the two vocalized tree tokens "yu+>ah~il+u" and "hA". Using the old algorithm, the INPUT STRING tokens would have been "y>hlh" and "A", clearly incorrect. With the new algorithm, they are instead "y>hl" and "hA". Another example: the source token "EmA" might be given a solution resulting in the two vocalized tree tokens "Em" and "A". Using the old algorithm, the INPUT STRING tokens would have been "Em" and "A". Instead they are now "E" and "mA". The algorithm used since ATB5, and for this release, corrects such cases. However, there is no general solution to the problem of using the source token and vocalized tree tokens in order to split up the source token accordingly. The specific solution essentially requires accounting for all of the various sorts of normalization that might occur in SAMA as part of producing the vocalized tree tokens for each future corpus. We plan for future releases to continue utilizing the present improved creation of the INPUT STRING tree tokens, as is done in this release. However, this is not part of the annotation process itself, as explained above, and it is possible that future releases either will not include extensive checking on the creation of these INPUT STRING tree tokens, or will leave out completely such tokens. (There are some remaining cases that are somewhat trickier to categorize. For example, the source token INPUT STRING "mnA" has the solution "min+nA", separated into two different vocalized tree tokens, with the corresponding UNVOCALIZED tokens "mn" and "nA" (see following section). However, there is only one "n" to distribute among the two tokens for the tree token INPUT STRING. We have chosen to partition this as "m"+"nA", to keep the the INPUT STRING representation of the suffix consistent.) ----------------------------------------------------------------- 1c. Improvements to problematic UNVOCALIZED tokens ----------------------------------------------------------------- As noted above in Section 1a, UNVOCALIZED tokens had an odd sort of hybrid definition. This led to inconsistencies in the treebank. While the vocalized tree tokens have a clear definition as part of the annotation process, and the INPUT STRING tree tokens also have a reasonably clear meaning (even if nontrivial to obtain), this is not true of the UNVOCALIZED tokens. In this release we have simplified the definition to make the UNVOCALIZED tree tokens be the VOCALIZED tree tokens with diacritics stripped out (i.e., treating all tokens in the same way as split tokens were treated in earlier releases of this segment.) We illustrate this change with two examples showing what the UNVOCALIZED forms would have been without the current corrections, and how the current definition resolves previous inconsistencies. -------------------------------------------- EXAMPLE 1: -------------------------------------------- 1) ALJZ_NEWS15_ARB_20060111_085801, P51 source token=Ant$Arh yields two tree tokens: tree token P51W14 VOCALIZED: {inoti$Ar+u- IS_TRANS: Ant$Ar UNVOCALIZED: {nt$Ar tree token P51W15 VOCALIZED: -hu IS_TRANS: h UNVOCALIZED: h Since the source token was split, the UNVOCALIZED string for P51W14 was set to the VOCALIZED token with diacritics removed under the old algorithm. 2) ALJZ_NEWS15_ARB_20050104_090001 P62 source token=Ant$Ar yields one tree token: tree token P62W9 VOCALIZED: {inoti$Ar+u IS_TRANS: Ant$Ar UNVOCALIZED: Ant$Ar Since the source token is not split, the UNVOCALIZED string for P62W9 was set to IS_TRANS, under the old algorithm. Therefore, using the old algorithm the two tokens appeared with the same input string (Ant$Ar) and the same vocalized token ({inoti$Ar+u), but different unvocalized tokens ({nt$Ar and Ant$Ar). Using the new algorithm in this release with the current fix, the UNVOCALIZED string for both is {nt$Ar. -------------------------------------------- EXAMPLE 2: -------------------------------------------- 1) ALHURRA_NEWS13_ARB_20050412_130100, P141 source token = AlAqtSAd yields one tree token: tree token P141W21 VOCALIZED: Al+{iqotiSAd+i IS_TRANS: AlAqtSAd UNVOCALIZED: AlAqtSAd Since the source token is unsplit, the UNVOCALIZED string for P141W21 was set to the IS_TRANS string, using the old algorithm. 2) ALHURRA_NEWS13_ARB_20051124_130100, P225 source token = bAlAqtSAd yields two tree tokens tree token P225W1 VOCALIZED: bi- IS_TRANS: b UNVOCALIZED: b tree token P225W2 VOCALIZED: -Al+{iqotiSAd+i IS_TRANS: AlAqtSAd UNVOCALIZED: Al{qtSAd Since the source token is split, the UNVOCALIZED string for P225W2 was set to the VOCALIZED token with diacritics removed, under the old algorithm. 3) ALHURRA_NEWS13_ARB_20051124_130100, P8 source token = Alan~a_1] UNSPLITVOC: (bi>an~ahu) POS: PREP+SUB_CONJ+PRON_3MS VOC: bi+>an~a+hu GLOSS: by/with+that+it/he This word is status 3 because the given solution is not in SAMA (and so cannot be status 1), and furthermore is status 3 because the VOC value is not the same as the IS_TRANS 4. The source token is a case of punctuation or a foreign word that is not included in the check for consistency with SAMA. For example, in ANN20020115.0001.txt: INPUT STRING: 650 IS_TRANS: 650 INDEX: P1W1 OFFSETS: 0-4 TOKENS: P1W1-P1W1 STATUS: 4 LEMMA: [DEFAULT] UNSPLITVOC: (650) POS: NOUN_NUM VOC: 650 GLOSS: nogloss This status field is also included now as field 8 of the source token in the integrated format. (See Section 4b.) In this release, there are 339710 source tokens, categorized with the following statuses: STATUS 1: 287282 STATUS 2: 949 STATUS 3: 4323 STATUS 4: 47156 ======= 339710 In current annotation and future releases of this segment, the intent is that STATUS 2 will be reserved for those words that are Arabic but are not expected to have a solution in SAMA (DIALECT, TYPO, FOREIGN, etc.), while STATUS 3 will be reserved for those words that are Arabic and would ideally have a solution in SAMA (such as the bAnh example above). STATUS 4 will continue to be used for source tokens that are non-Arabic and so "outside" of SAMA. Please see the file errata.txt for more discussion of the tokens that have status 3. ======================================================= 2b. Explicit mapping between source and tree tokens, and other changes in "before" pos file. ======================================================= The "before" pos file contains not only the STATUS information as described in 2a above, but also the full information for this field as it exists in a SAMA solution, and an explicit mapping to the tree tokens. The new field TOKENS: indicates the mapping between the source token and the corresponding tree tokens, which may be a 1-many relationship. For example, the token at index P2W11 in the "before" pos file for ANN20020115.0001.txt is: INPUT STRING: سيشاركون IS_TRANS: sy$Arkwn INDEX: P2W11 OFFSETS: 69-78 TOKENS: P2W12-P2W13 STATUS: 1 LEMMA: [$Arak_1] UNSPLITVOC: (sayu$Arikuwna) POS: FUT_PART+IV3MP+IV+IVSUFF_SUBJ:MP_MOOD:I VOC: sa+yu+$Arik+uwna GLOSS: will+they (people) + participate with/share with + [masc.pl.] which indicates that the two tree tokens, P2W12 and P2W13, arise from this source token, as shown in the "after" pos file: INPUT STRING: س IS_TRANS: s COMMENT: [] INDEX: P2W12 OFFSETS: 69,70 UNVOCALIZED: s VOCALIZED: sa- POS: FUT_PART GLOSS: will INPUT STRING: يشاركون IS_TRANS: y$Arkwn COMMENT: [] INDEX: P2W13 OFFSETS: 70,78 UNVOCALIZED: y$Arkwn VOCALIZED: -yu+$Arik+uwna POS: IV3MP+IV+IVSUFF_SUBJ:MP_MOOD:I GLOSS: they (people) + participate with/share with + [masc.pl.] The POS, VOC, and GLOSS fields in the "before" pos file are simply a concatenation of their respective values in the corresponding tree tokens (with + as a separator and hyphens removed). The LEMMA is now included in the "before" pos file, rather than the "after". That is because the lemma is associated with a SAMA solution, and therefore associated with a source token. Earlier releases took the unnecessary step of assigning the lemma to one particular tree token, sometimes in an arbitrary way, with the other tokens for that lemma assigned the dummy lemma "[clitics]". The UNSPLITVOC is the SAMA vocalization for the source token as a single word, which can in some cases be distinct from the VOC formed from the vocalizations of the separate morphemes. ============================================================================ 3. File Extensions and Directory Structure ============================================================================ Each FILE in docs/file.ids has a corresponding file in the following directories. data/tdf/FILE.tdf (utf-8) Source files. data/pos/before-treebank/FILE.txt (utf-8) Information about the tokens used for analysis with SAMA (the "source tokens," in the terminology used in Section 1). So this is a listing of each token before clitic-separation. Each token contains the following information: ----------------------------------------------------------- INPUT STRING: (utf-8 characters from .tdf file) IS_TRANS: (Buckwalter transliteration of previous, used for input to SAMA.) INDEX: (automatically assigned index, based on paragraph&word#) OFFSETS: (start,end - pair of integers - Annotation Graph offset into tdf file, corresponding to the INPUT STRING) TOKENS: (start-end - two indices indicating the tree tokens in the corresponding pos/after-treebank/FILE.txt file that correspond to this source token) STATUS: (the status of this solution, with respect to SAMA.) LEMMA: (the lemma associated with this source token and solution in SAMA.) UNSPLITVOC: (the vocalized form (not separated into segments) of the source token solution, from SAMA) POS: (pos for this source token) VOC: (vocalization for this source token) GLOSS: (gloss for this source token) ----------------------------------------------------------- The POS, VOC, and GLOSS fields are redundant with the respective values of the corresponding tree tokens. See Section 2 above for more detail on these fields, along with STATUS, LEMMA, and UNSPLITVOC. data/xml/treebank/FILE.xml As discussed in Section 1a, this consists of the result of splitting the tokens used for POS Annotation for the purposes of treebank annotation, and then modified during Treebank Annotation with tree information and further POS changes. These are referred to as "tree tokens" in Section 1a. data/pos/after-treebank/FILE.txt Information about each tree token in the corresponding xml/treebank FILE.xml file. Each token contains the following information: ----------------------------------------------------------- INPUT STRING: (utf-8 characters from .tdf file) IS_TRANS: (Buckwalter transliteration of previous) COMMENT: (annotator comment about word) INDEX: (automatically assigned index, based on paragraph&word#) OFFSETS: (start,end - pair of integers - Annotation Graph offset into tdf file, corresponding to the INPUT STRING) UNVOCALIZED: (the unvocalized form of the word) VOCALIZED: (the vocalized form of the word, taken from the solution) POS: (the pos tag, taken from the solution) GLOSS: (the gloss, taken from the solution) ----------------------------------------------------------- See Sections 1b and 1c above for information about the derivation of INPUT STRING tokens and UNVOCALIZED tokens for clitic separated tree tokens. data/penntree/without-vowel/FILE.tree Penn Treebanking style output, generated from the xml/after-treebank .xml file. Each terminal is of the form (pos word), where pos and word correspond to the POS and UNVOCALIZED values for the corresponding token in pos/after-treebank/FILE.txt, respectively. data/penntree/with-vowel/FILE.tree Penn Treebanking style output, generated from the xml/after-treebank .xml file. Each terminal is of the form (pos word), where pos and word correspond to the POS and VOCALIZED values for the corresponding token in pos/after-treebank/FILE.txt, respectively. data/integrated/FILE.txt See section 4b for a description of the integrated format. (Note: The file formats in prior releases were somewhat different. Due to the nature of the ongoing improvements, the notes here refer only to this release. For similar information for previous releases, please see the corresponding documentation in those releases.) ============================================================================ 4. Additional Information ============================================================================ ============================================================================ 4a. data/xml/pos/FILE.xml not included ============================================================================ The data/xml/pos/FILE.xml file, if included, would contain the alternatives from the morphological analyzer at the time the analysis was originally done, which are sometimes then modified later in the annotation process. As a result, the POS information in the pos-level .xml files is not necessarily the same as in the treebank-level .xml files. To avoid confusion we therefore do not release the pos-level .xml files. Instead, the data/pos/before-treebank .txt files and the integrated files contain the information regarding the source token and now also with this release further information on the token's analysis in the treebank and its relation to SAMA. (See Sections 2a and 2b above.) ============================================================================ 4b. Description of the "Integrated" format ============================================================================ The goal of this format is to bring together in one place: 1) the information about the source tokens from the pos/before-treebank files, including the explicit mapping between the source and tree tokens. 2) the information about the tree tokens from the pos/after-treebank files 3) the tree structure The basic format of each file is: FILEPREFIX: file metadata (beginning with ;;) CHUNK: filename:chunk# chunk metadata (beginning with ;;) #source tokens: # #tree tokens: # #trees: 1 listing of source tokens listing of tree tokens TREE: filename:chunk#:tree#:#_tokens_in_tree tree with W# instead of the tree tokens with the CHUNK,TREE sections repeated. (The file and chunk metadata is taken directly from the Annotation Graph file and can be ignored or used as the user likes.) Each CHUNK corresponds to one "Paragraph" in the usual release terminology. (It is possible that in some versions of the treebank more than one tree may be associated with a paragraph, which is why there is a slot for "# trees" after the # of source tokens, and in the TREE: line. In this release it is always 1 tree per chunk, however, with TOP wrapped around it.) Each source token row consists of the following 7 items, separated by the character U+00B7: 1) s:# - the source token # 2) the source token text in utf8, corresponding to the source text. 3) the source token text, in Buckwalter transliteration 4) the starting offset 5) the ending offset 6) the start index of the corresponding tree token(s) 7) the end index of the corresponding tree token(s) 8) the status of this token with respect to SAMA, as discussed in Section 2b above. 9) The lemma for this source token 10) The unsplit vocalization for this source token 11) A status indicating whether this source token is mapped to corresponding tree tokens. (All "OK") These fields correspond to the information in the pos/before-treebank files as follows: 1) <-> INDEX (here counting from 0, in the pos/before-treebank file counting from 1) 2) <-> INPUT STRING 3) <-> IS_TRANS 4,5) <-> OFFSETS 6,7) <-> TOKENS (here counting from 0, in the pos/before-treebank file counting from 1) 8) <-> STATUS 9) <-> LEMMA 10) <-> UNSPLITVOC Each tree token row consists of the following 13 items: 1) t:# - the tree token # 2) POS tag 3) "f" or "t" - a boolean indicating whether this token was split from the previous tree token 4) "f" or "t" - a boolean indicating whether this token was split from the following tree token 5) vocalized form 6) gloss 7) offset start 8) offset end 9) text in utf8, corresponding to the source .tdf text. 10) unvocalized form 11) comment These fields correspond to the information in the pos/after-treebank files as follows: 1) <-> INDEX (here counting from 0, in the pos/before-treebank file counting from 1) 2) <-> POS 3,4,5) <-> VOCALIZED (with the separate boolean split information as hyphens on the VOCALIZED form.) 6) <-> GLOSS 7,8) <-> OFFSETS 9) <-> INPUT STRING 10) <-> UNVOCALIZED 11) <-> COMMENT For example if a file has a chunk with: #source tokens:27 #tree tokens:32 #trees:1 s:0 ·واوضح·wAwDH·0·5·0·1·1·[>awoDaH_1]·(wa>awoDaHa)·OK [...] t:0 ·CONJ·f·t·wa·and·nolemma·0·1·و·w·[] t:1 ·PV+PVSUFF_SUBJ:3MS·t·f·>awoDaH+a·clarify/explain/indicate + he/it [verb]·nolemma·1·5·اوضح·>wDH·[] [...] TREE:AAW_ARB_20080502.0027-S1:2:1:32 (TOP (S W0 (VP W1 (NP-SBJ (-NONE- *)) (SBAR W2 (S (NP-TPC-1 W3 (NP W4)) (VP (PRT W5) W6 (NP-SBJ-1 (-NONE- *T*)) (PP-CLR W7 (NP (NP (ADJP W8 (NP W9))) (NP-ADV W10 (NP (NP W11) (SBAR (SBAR (WHNP-2 W12) (S (VP W13 (NP-SBJ-2 (-NONE- *T*)) (NP-OBJ-2 (-NONE- *)) (PP-CLR W14 (NP W15)) (PP W16 (NP (NP W17 (NP W18)) (PP W19 (NP W20 (NP (NP W21 W22) (ADJP W23))))))))) W24 W25 (SBAR (WHNP-3 (-NONE- *0*)) (S (NP-SBJ W26 (NP (NP W27) (NP-3 (-NONE- *T*)))) (NP-PRD W28 (NP W29 (NP W30)))))))))))))) W31)) This indicates that: 1) The source token#0 maps to tree tokens #s 0 and 1 . The source token text is wAwDH, and the two corresponding tokens are wa/CONJ and >awoDAH+a/PV+PVSUFF_SUBJ:3MS. They are represented in the VOCALIZED field in the pos/after-treebank file with wa- and ->awoDaH+a, whereas here the hyphen is indicated by the f/t and t/f values in t:0 and t:1 (Of course this information is also redundant with the mapping from the source token.) 2) The source token offset <0,5> for the source token wAwDH has been partitioned into <0,1> for wa and <1,5> for >awoDaH+a. 3) The leaves W0 and W1 in the tree correspond to the tree tokens wa- and ->awoDaH+a. ============================================================================ 4c. A note about multiple trees on one line ============================================================================ It is possible for one line in the .tree files to include more than one complete tree. The reason for this is that the annotators work on one "Paragraph" (CHUNK) at a time - e.g., a tree with the root "Paragraph" node, as can be seen by looking at the "treebanking" feature in the xml files. When the trees are generated, the "Paragraph" node is dropped. If the form of the annotation was (Paragraph S1 S2), where S1 and S2 are both complete trees, then they will appear on one line. This is also true for the integrated format, except that in that format "TOP" appears instead of "Paragraph."