Arabic Treebank: Part 1 v 3.0 Publication 11/28/04 TECHNICAL CHARACTERISTICS OF THE CORPUS The AFP newswire collection comes to the LDC via a dedicated satellite downlink. The satellite transmission is converted to serial data and captured as text on a Sun/sparc server by means of a special modem. The text stream, as initially captured, consists of ASCII data only, in which the alphabetic portion of the ASCII table is being used to convey the ISO-8859-6 Arabic character set. In particular, the ASCII byte values for "A" through "r", which cover the hexadecimal numeric range 0x41 - 0x72, represents the Arabic letters in the ISO-8859-6 table labeled "hamza" - "sukun", covering the hexadecimal range 0xC1 - 0xF2. The ASCII digits, whitespace and common punctuation characters (period, comma, colon, etc) represent themselves in the text. So the first stage of processing after capture consists of converting the data from the AFP-specific character encoding to the ISO-8859-6 encoding, by adding a numeric value of 0x80 (i.e. setting the high bit) for each alphabetic character in the raw data. (Conversion to any other standard character encoding, such as UTF-8, is then done later as needed, using standard conversion tools.) Another feature of the AFP text stream is that it is pre-formatted for line breaks and paragraph structure, and the characters on each line are presented in strict right-to-left display order. This has an impact on strings of digits, which are interpreted by readers of printed Arabic in left-to-right order. The following illustrates the problem: Sample phrase: this is day 10 in the year 1987. How it looks when printed in Arabic (assume that lower-case letters are actually Arabic script characters; angle brackets below the line reflect the direction in which the reader scans each token on the line): .1987 raey eht ni 10 yad si siht >>>> <<<< <<< << >> <<< << <<<< How the characters are ordered in the original AFP text stream: first byte: this is day 01 in the year 7891. :last byte When this sample AFP text stream is fed to a simple right-to-left display/print device, the characters and digit strings are rendered in the order that the Arabic reader expects. In other words, to present the text in strict right-to-left display order, AFP must reverse the logical ordering of digit strings. As a result, another stage of processing to prepare this data for annotation involves locating all strings of two or more digits, and inverting them to produce logical ordering. (Some care is needed to correctly order digit strings that have internal and/or adjacent punctuation characters.) HOW DATA IS ORGANIZED IN THE ANNOTATION GRAPH (AG) XML FILE All annotation is saved in AG based xml (the dtd files are under the docs/ directory). To make it convenient for people who don't use AG at the moment, we generated the POS and treebank output in text files as well. In the AG files, information is organized as follows (see scripts in appendix/bin/): 1) Each AG file can have many paragraphs. 2) Each paragraph can have many words and a tree structure. 3) Each word have two feature annotations: 'word' and 'solution' 4) Furthermore, 'word' has annotation fields for lookup-word, selection and comment, while 'solution' has a list of fields for possible POS candidates and gloss. 5) The selection field belonging to the 'word' points to the solution that is selected by the annotator. In the treebank annotation, the trees are saved at the paragraph level with the metadata name 'treebanking', while any corresponding comment is saved under the metadata name 'tbcomment'. For example, we have: ( Paragraph ( X 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ) ( S ( NP-TPC-1 14 15 ) ( VP 16 ( NP-SBJ-1 *T* ) ( PP 17 ( NP 18 .... LOC 56 ( NP 57 ( NP ( NP 58 ( NP 59 60 ) ) ( PP-LOC 61 ( NP 62 ( NP 63 ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ( ADVP-TMP-4 *T* ) ) ) ) ) 64 ) ) where the numbers are the indices of the words in the paragraph (0 based). INDICATOR OF CLITIC SEPARATION When a clitic is separated, we add a "-" on each side of the separation to indicate where the separation happens. For example, '>an~a+ka' (that+you) becomes '>an~a-' and '-ka' in the clitic separation process.