Contents: 1. Two Levels of Annotation 2. File Extensions and Directory Structure 3. Changes From Previous Release Structure 4. The UNVOCALIZED form in the after-treebank data 5. A note about multiple trees on one line ============================================================================ 1. Two Levels of Annotation ============================================================================ The next section describes in detail the different file extensions and directories used to store the annotated data. The key fact underlying this organization is that there are two separate stages of annotation, as described under "Annotation Process" in readme.html, and these are stored in two separate Annotation Graph files, from which other files are generated. 1) POS annotation - This is the selection of a POS tag for a token from the source text file. In previous releases, the Annotation Graph .xml file containing this information was stored in the data/xml/pos directory. However, for reasons discussed in Section 3, this file is not being included in this release. However, we are still including the text files in data/pos/before-treebank which contain the key information from those .xml files. 2) Treebank annotation - The tokens from the POS Annotation are modified, by splitting off clitics, to create the tokens used for Treebank annotation, and these tokens, and the new tree structure, are stored in different Annotation Graph .xml files. These files are stored, as in previous releases, in data/xml/treebank. Various other files are generated from this data: the text files in data/pos/after-treebank and the three different tree files in data/penntree and its subdirectories. ============================================================================ 2. File Extensions and Directory Structure ============================================================================ Each FILE in docs/file.ids has a corresponding file in the following directories. data/sgm/FILE.sgm (utf-8) Processed source files in sgml format. Please note that there is a parallel text and English Treebank corpus that has been developed at LDC for these same 599 source files and that has been released to the GALE community and will be published soon. data/pos/before-treebank/FILE.txt (utf-8) Information about the tokens used for the original analysis with the Buckwalter analyzer. So this is a listing of each token before clitic-separation. Each token contains the following information: ----------------------------------------------------------- INPUT STRING: (utf-8 characters from .sgm file) IS_TRANS: (Buckwalter transliteration of previous, used for input to analyzer) COMMENT: (annotator comment about word) INDEX: (automatically assigned index, based on paragraph&word#) OFFSETS: (start,end) - pair of integers - Annotation Graph offset into .sgm file, corresponding to INPUT STRING ----------------------------------------------------------- Future releases will also include a listing of alternatives for IS_TRANS from the current version of the Buckwalter analyzer, with one marked as the correct solution. For discussion of why this is not included in this release, see Section 3 below. Both INPUT STRING and IS_TRANS are trimmed so that any leading and trailing whitespace pointed to by the OFFSETS are deleted. data/xml/treebank/FILE.xml As discussed in Section 1, this consists of the result of splitting the tokens used for POS Annotation (and therefore included in /data/pos/before-treebank/FILE.txt) for the purposes of treebank annotation, and then modified with treebanking information and further POS changes. data/pos/after-treebank/FILE.txt Information about each token in the corresponding xml/after-treebank .xml file. So this is a listing of each token after clitic-separation. Each token contains the following information: ----------------------------------------------------------- INPUT_STRING: (utf-8 characters from .sgm file) IS_TRANS: (Buckwalter transliteration of previous) COMMENT: (annotator comment about word) INDEX: (automatically assigned index, based on paragraph&word#) OFFSETS: (start,end) - pair of integers - Annotation Graph offset into sgm file UNVOCALIZED: (the unvocalized form of the word) VOCALIZED: (the vocalized form of the word, taken from the solution) VOC_STRING: (the Arabic utf-8 of the vocalized form) POS: (the pos tag, taken from the solution) GLOSS: (the gloss, taken from the solution) LEMMA: (the lemma, taken from the solution) ----------------------------------------------------------- For further discussion of these items, see Section 3 below. data/penntree/without-vowel/FILE.tree Penn Treebanking style output, generated from the xml/after-treebank .xml file. Each terminal is of the form (pos word), where pos and word correspond to the POS and UNVOCALIZED values for the corresponding token in pos/after-treebank/FILE.txt, respectively. data/penntree/with-vowel/FILE.tree Penn Treebanking style output, generated from the xml/after-treebank .xml file. Each terminal is of the form (pos word), where pos and word correspond to the POS and VOCALIZED values for the corresponding token in pos/after-treebank/FILE.txt, respectively. data/penntree/combined-utf8/FILE.tree Also generated from the xml/after-treebank .xml file. We are including this new combined form to make it easier to relate the full information about each word to the tree structure, without having to work with the data/xml/treebank/FILE.xml or the information in the data/pos/after-treebank/FILE.txt and data/penntree/with(out)-vowel/FILE.tree files. These trees are not meant to be easy for people to read, but rather to collect in one place all the relevant information for further processing as people choose. This file format is a mix of penntree-like representation and a variant of the text information in the pos/after-treebank/FILE.txt files. The tree contains stand-ins for each of the lexical items. e.g.: (FRAG (NP (NOUN_NUM W1) (NP (NOUN+CASE_INDEF_ACC W2) .... and then following the tree the items W1,W2, etc. are listed. Each such W item has the following formation on one line, with the character U+00B7 used as a delimiter: IS_TRANS COMMENT INDEX OFFSET start OFFSET end UNVOCALIZED VOCALIZED GLOSS LEMMA BAMAVOC LOOKUP_STATUS All except the last two items are exactly as in the pos/after-treebank/FILE.txt file. BAMAVOC and LOOKUPSTATUS are two additional pieces of information, found explicitly or implicitly in the xml/treebank/FILE.xml file, and are provided for obsessive completeness only. BAMAVOC is the relevant substring of the vocalized form produce by the BAMA morphological analyzer, vocalized without any segmentation. It is included as part of the solution for a word in the .xml file, along with the lemma and the usual vocalized form. (How it is different from the vocalized form is beyond the scope of this readme.) The LOOKUP_STATUS gives some additional information as to whether the token was actually sent through BAMA originally. The most common LOOKUP_STATUS is 3 and indicates that the token was part of a token from the source file that was passed into BAMA. The LOOKUP_STATUS 1 indicates that the word was not sent through BAMA, and is usually limited to punctuation or numbers. The LOOKUP_STATUS 2 is used to indicate a word with an empty UNVOCALIZED form, which occurs for the few cases in which vocalized form is a suffix consisting entirely of diacritics, in which case we use the dummy term "nullp" for the UNVOCALIZED form. ============================================================================ 3. Changes From Previous Release Structure ============================================================================ ============================================================================ 3a. data/xml/pos/FILE.xml not included: ============================================================================ This release contains a substantially modified treebank both in terms of the tags, the tokenization, and the trees. These changes were incorporated only into the data/xml/treebank .xml files, since those .xml files, not the ones in data/xml/pos, are used for ongoing annotation and modification. The data/xml/pos .xml file contains the alternatives from the Buckwalter analyzer at the time the analysis was originally done, which does not include the new POS changes. As a result, the POS information in the .xml files previously released in the data/xml/pos directory is obsolete and should be ignored, and so is not included here. We have instead extracted all the relevant information as to the original tokens from the .sgm file sent into the BAMA analyzer and included them in the data/pos/before-treebank .txt files, as discussed above in Sections 1 and 2. ============================================================================ 3b. For the pos/before-treebank/FILE.txt file, the changes are as follows: ============================================================================ 1. The LOOK-UP word is now called the IS_TRANS, for "input string transliteration". For non-punctuation/number items, this is the same as what was previously labeled the LOOK-UP word. Punctuation and numbers did not previously have a LOOK-UP word, since they were not sent through the analyzer. 2. The OFFSETS into the .sgm file are now included. This is done for two reasons. First, on general principle, so that this information can be obtained without having to go through the .xml files. Second, to more easily relate the token information in the pos/before-treebank and the pos/after-treebank files, again without having to go through the corresponding .xml files as intermediaries. 3. Earlier releases included the various POS alternatives at the time of POS annotation. For the same reason that the data/xml/pos .xml file is not included, as discussed in Section 3a above, these POS alternatives are no longer included. ============================================================================ 3c. For the pos/after-treebank/FILE.txt file, the changes are as follows: ============================================================================ 1. What used to be called the LOOKUP-WORD is now called UNVOCALIZED. The concept of LOOKUP-WORD for the after-treebank files is actually meaningless, since the lookup in the Buckwalter analyzer is done only on the before-treebank words. This field is now called UNVOCALIZED to make it clear that it is the value used in the without-vowel files. See Section 4 below for some more discussion of this. 2. IS_TRANS is the Buckwalter transliteration of the INPUT_STRING. This information was not included before (although it was derivable from the INPUT_STRING). As discussed in the paper cited in Section 4, it is *not* necessarily the same as the former LOOKUP-WORD, now UNVOCALIZED. 3. OFFSET - offsets into the .sgm file. Previously, this information was contained only in the corresponding after-treebank/FILE.xml file. As indicated above for the pos/before-treebank/FILE.txt file, it is our hope that this will make it easier to relate the unsplit tokens in the before-treebank and after-treebank pos files. The OFFSETs in this file correspond to the INPUT_STRING. 4. VOCALIZED,GLOSS,LEMMA are now included as separate fields. Previously they were contained only in the marked (with an asterisk) solution from the POS alternative. The POS alternatives are not included, both for the same reason as they are not included in the pos/before-treebank/FILE.txt file, and also because in any case the POS alternatives are only relevant for the tokens as sent through BAMA, not the tokens as used for treebank annotation. 5. VOC_STRING is the Arabic utf-8 string corresponding to the VOCALIZED solution. This information was not previously included (although derivable from the vocalized information in the solution). ============================================================================ 3d. penntree/combined/FILE.tree: ============================================================================ This tree was not included before. As discussed above in Section 2, under data/penntree/combined/FILE.tree, it includes all of the information from the corresponding after-treebank/FILE.txt file. Our hope is that this will make it easier for users to utilize the various aspects of the annotation, without needing to spend time aligning separate files. ============================================================================ 4. The UNVOCALIZED form in the after-treebank data ============================================================================ For tokens which are not split, the offset and other information does not change between the before-treebank and after-treebank files. (Aside from the INDEX, which is created on the fly as the POS files are created, and so can change depending on the number of previous tokens.) The situation is more complicated for split tokens, in which the UNVOCALIZED form of the word was created by deleting diacritics from the relevant segment of the vocalized solution. The following paper discusses this issue in detail: Mohamed Maamouri, Seth Kulick, Ann Bies Diacritic Annotation in the Arabic Treebank and Its Impact on Parser Evaluation; LREC 2008, Marrakech, Morocco, May 28-30, 2008 http://papers.ldc.upenn.edu/LREC2008/Diacritic_Annotation_ATB.pdf Given the differences between the formerly-called LOOK-UP WORD and the original INPUT STRING, it is not accurate to refer to this as a LOOK-UP WORD, which is why we now refer to this as UNVOCALIZED. In addition to the case of the UNVOCALIZED form in split tokens as discussed in the paper, it is currently the case that some other words can be "out of sync" between the UNVOCALIZED and VOCALIZED form, since it is only the VOCALIZED forms that have been modified as part of the revision. For example, it is possible that a token An was earlier analyzed as an~, in which case the UNVOCALIZED form is still an~a. ============================================================================ 5. A note about multiple trees on one line ============================================================================ It is possible for one line in the .tree files to include more than one complete tree. The reason for this is that the annotators work on one "Paragraph" at a time - e.g., a tree with the root "Paragraph" node, as can be seen by looking at the "treebanking" feature in the xml files. When the trees are generated, the "Paragraph" node is dropped If the form of the annotation was (Paragraph S1 S2), where S1 and S2 are both complete trees, then they will appear on one line.