1. Overview We used the ISO 8859-6 character-encoding standard for the script version of the Arabic transcripts, with some exceptions for handling the special symbols used in LDC transcripts. In addition, we made some changes in formatting from the romanized version in order to work around the conflict between logical order and display order described below. Mixing (right-to-left) Arabic with (left-to-right) ASCII information presents a bit of a problem. Suppose you have a stretch of text whose _logical_ order is: [Timestamp] [Arabic text 1] [English text] [Arabic text 2] The most readable _display_ order for this text might be (where "R-to-L" means "in right-to-left visual order"): [Timestamp] [Arabic 2, R-to-L] [English, L-to-R] [Arabic 1, R-to-L] although one could choose a number of other reasonable display orders, for example: [Timestamp] [Arabic 1, R-to-L] [English, L-to-R] [Arabic 2, R-to-L] [Arabic 2, R-to-L] [English, L-to-R] [Arabic 1, R-to-L] [Timestamp] The Unicode Standard has a solution for this problem, but unfortunately we did not have available a text editor which handles Arabic Unicode. So, in order to make the correspondence between presentation and logical orders as straightforward as possible, we adopted a strategy of inserting newlines wherever a change of direction occurs. So the above text would be encoded as [Timestamp] [Arabic text 1] [English text] [Arabic text 2] More detail (including some exceptional cases) below. 2. Turn boundaries A turn starts with a timestamp like the timestamps used in the romanized transcripts, but it should be on a line by itself with no following text. 204.17 205.74 A: text line 1 text line 2... Each non-blank line following the timestamp is part of the turn. The turn is terminated by one or more blank lines, or by the end of the file. 3. Meta-characters In several cases below we use the term "meta-character", to mean turning on the 8th bit of the corresponding ASCII character. For example, "meta-?" means an ASCII question mark (octal 077, decimal 63, hex 3F) plus the 8th bit, yielding an ISO 8859-6 Arabic question mark (octal 277, decimal 191, hex BF). 4. Punctuation The transcribers used punctuation relatively infrequently in the transcripts. Punctuation marks that are not special symbols (see below) are limited to question marks and commas. These were converted into the corresponding ISO-Arabic codes, "meta-?" and "meta-comma". 5. Special symbols Most of the special symbols in the romanized transcripts mark sections of the file that are not conventional, scoreable Arabic speech. Thus they were rendered in the Arabic script version as plain ASCII, with a few modifications to make the text a bit easier to parse. The following were passed as-is (and isolated from Arabic script text by newlines if necessary): {text} sound made by the talker [text] sound not made by the talker (background or channel) [[text]] comment speech in another language Even Arabic dialects, such as MSA and Upper, were left as roman and not converted to script. BUT see below for the handling of combinations like "il+". ((text)) unintelligible; text is best guess at transcription The text was not translated to Arabic script, BUT see below for the handling of combinations like "il+(( ))". (( )) unintelligible; can't even guess text unintelligible speech in unknown foreign language **text** idiosyncratic word, not in common use -text partial words text- Annotations using the following symbols were modified in most cases. However, if they occurred _inside_ one of the above bracketed categories (for example, ""), they were left as they were in the original romanized transcript. #text# simultaneous speech on the same channel Converted to: text with intervening newlines in the "text" as necessary to accommodate directional changes. //text// aside (talker addressing someone in background) Converted to: with intervening newlines in the "text" as necessary to accommodate directional changes. +text+ mispronounced word (spelled in usual orthography) These were changed to have a single flag character ("+word"), like other annotations that affect a single word. In addition, we used "meta-plus" instead of an ASCII plus. This character is unassigned in ISO 8859-6, but is a "double less than" symbol in ISO 8859-1. The Mule editor can display this as an Arabic (right-to-left) character, so the ISO-to-Mule package (see below) displays it as "double less than" (aka "left chevron" or "left double quote"). %text non-lexeme These were converted to _ASCII_ exclamation points, because this character could be displayed in Mule without making the ISO translation too strange. These appear in logical order in the files, i.e. !word and appear at the beginning (right) of the word when display in Mule, i.e. drow! &text used to mark proper names and place names These were converted to "meta-semicolon", which is an ISO Arabic semicolon character. They appear in the same order as "%" -> "!" above. The romanized transcripts contain many cases in which an "inseparable" article or preposition is attached to a proper name, such as: "il+&a$raf" "bi+&amAl" "bi+il+&cagami" These are rendered in the script version with the marker at the front of the entire word, e.g. (meta-;)ila$raf (meta-;)biamAl text -- marks end of interrupted turn and continuation -- text of same turn after interruption Such hyphens were treated like incomplete words, that is, they were separated from Arabic script by newlines but placed on the same line as left-to-right text such as incomplete words. 6. Special cases a. combined Arabic/non-Arabic tokens The transcripts occasionally contain mixtures such as: il+(( )) li+(( )) il+((SiHHaB~)) il+ il+ il+ fa+ il+ These present a problem, since putting a newline between the Arabic and non-Arabic would lose the information that in some sense they are single tokens. We tried to choose a fairly neutral compromise: we isolated such cases on individual lines and converted the unbracketed portion to Arabic script. Users of the transcripts are then free to do what they like with such cases. Note that Mule displays these in "backwards" order, with the Arabic on the left and the English on the right, but the logical order in the file is correct. b. ambiguous spellings As explained elsewhere in the documentation of this corpus, the romanized text is occasionally ambiguous with respect to the Arabic script. There are several hundred entries in the lexicon with two or more possible spellings in Arabic script, corresponding to several thousand ambiguous lexical tokens in the transcripts. These were handled as follows: First, we output the alternatives, separated by "||" and placed on separate lines to prevent ambiguity of turn order. Then, the transcription staff searched for vertical bars in the transcripts and edited them manually to choose the correct alternative. 7. MULE editor and viewing the text We used an ISO-to-Mule converter graciously provided by TAKAHASHI Naoto to view and edit the Arabic script transcripts in the MULE (multi-lingual emacs) editor. With his permission, we are including his converter in this distribution, with a few alterations we made to facilitate its use with our transcripts. Mule is freely available via anonymous FTP from various sites, including ftp.cs.buffalo.edu and etlport.etl.go.jp. The ISO-to-Mule converter seems to work fine with Mule version 2.2 (1994) and version 2.3 (1995).