1.  Overview

We used the ISO 8859-6 character-encoding standard for the script
version of the Arabic transcripts, with some exceptions for handling
the special symbols used in LDC transcripts.  In addition, we made
some changes in formatting from the romanized version in order to work
around the conflict between logical order and display order described
below.

Mixing (right-to-left) Arabic with (left-to-right) ASCII information
presents a bit of a problem.  Suppose you have a stretch of text whose
_logical_ order is:

  [Timestamp] [Arabic text 1] [English text] [Arabic text 2]

The most readable _display_ order for this text might be (where
"R-to-L" means "in right-to-left visual order"):

  [Timestamp] [Arabic 2, R-to-L] [English, L-to-R] [Arabic 1, R-to-L]

although one could choose a number of other reasonable display orders,
for example:

  [Timestamp] [Arabic 1, R-to-L] [English, L-to-R] [Arabic 2, R-to-L]
  [Arabic 2, R-to-L] [English, L-to-R] [Arabic 1, R-to-L] [Timestamp]

The Unicode Standard has a solution for this problem, but
unfortunately we did not have available a text editor which handles
Arabic Unicode.  So, in order to make the correspondence between
presentation and logical orders as straightforward as possible, we
adopted a strategy of inserting newlines wherever a change of
direction occurs.  So the above text would be encoded as

    [Timestamp]
    [Arabic text 1]
    [English text]
    [Arabic text 2]


More detail (including some exceptional cases) below.


2.  Turn boundaries

A turn starts with a timestamp like the timestamps used in the
romanized transcripts, but it should be on a line by itself with no
following text.

        204.17 205.74 A:
        text line 1
        text line 2...

Each non-blank line following the timestamp is part of the turn.  The
turn is terminated by one or more blank lines, or by the end of the
file.


3.  Meta-characters

In several cases below we use the term "meta-character", to mean
turning on the 8th bit of the corresponding ASCII character.  For
example, "meta-?"  means an ASCII question mark (octal 077, decimal
63, hex 3F) plus the 8th bit, yielding an ISO 8859-6 Arabic question
mark (octal 277, decimal 191, hex BF).


4.  Punctuation

The transcribers used punctuation relatively infrequently in the
transcripts.  Punctuation marks that are not special symbols (see
below) are limited to question marks and commas.  These were converted
into the corresponding ISO-Arabic codes, "meta-?" and "meta-comma".


5.  Special symbols

Most of the special symbols in the romanized transcripts mark sections
of the file that are not conventional, scoreable Arabic speech.  Thus
they were rendered in the Arabic script version as plain ASCII, with a
few modifications to make the text a bit easier to parse.

The following were passed as-is (and isolated from Arabic script text
by newlines if necessary):

    {text}              sound made by the talker

    [text]              sound not made by the talker (background or channel)

    [[text]]            comment

    <language text>     speech in another language

                        Even Arabic dialects, such as MSA and Upper,
                        were left as roman and not converted to script.

                        BUT see below for the handling of combinations like
                        "il+<English e-mail>".

    ((text))            unintelligible; text is best guess at transcription

                        The text was not translated to Arabic script,
                        BUT see below for the handling of combinations like
                        "il+(( ))".

    (( ))               unintelligible; can't even guess text

    <? (( ))>           unintelligible speech in unknown foreign language

    **text**            idiosyncratic word, not in common use

    -text               partial words
    text-               

Annotations using the following symbols were modified in most cases.
However, if they occurred _inside_ one of the above bracketed
categories (for example, "<English &mohamed //she just hanged ((the
phone))// >"), they were left as they were in the original romanized
transcript.

    #text#              simultaneous speech on the same channel

                        Converted to:
                        <overlap/> text </overlap>
                        with intervening newlines in the "text" as 
			necessary to accommodate directional changes.

    //text//            aside (talker addressing someone in background)

                        Converted to:
                        <aside/> text </aside>
                        with intervening newlines in the "text" as necessary
                        to accommodate directional changes.

    +text+              mispronounced word (spelled in usual orthography)

                        These were changed to have a single flag character
                        ("+word"), like other annotations that affect a
                        single word.  In addition, we used "meta-plus"
                        instead of an ASCII plus.  This character is
                        unassigned in ISO 8859-6, but is a "double less
                        than" symbol in ISO 8859-1.  The Mule editor can
                        display this as an Arabic (right-to-left)
                        character, so the ISO-to-Mule package (see below)
                        displays it as "double less than" (aka "left
                        chevron" or "left double quote").

    %text               non-lexeme

                        These were converted to _ASCII_ exclamation points,
                        because this character could be displayed in Mule
                        without making the ISO translation too strange.
                        These appear in logical order in the files, i.e.
                                !word
                        and appear at the beginning (right) of the word
                        when display in Mule, i.e.
                                drow!

    &text               used to mark proper names and place names

                        These were converted to "meta-semicolon", which is
                        an ISO Arabic semicolon character.  They appear in
                        the same order as "%" -> "!" above.

                        The romanized transcripts contain many cases in
                        which an "inseparable" article or preposition is
                        attached to a proper name, such as:
                                "il+&a$raf" "bi+&amAl" "bi+il+&cagami"
                        These are rendered in the script version with the
                        marker at the front of the entire word, e.g.
                                (meta-;)ila$raf (meta-;)biamAl

    text --             marks end of interrupted turn and continuation
    -- text             of same turn after interruption

                        Such hyphens were treated like incomplete words,
                        that is, they were separated from Arabic script by
                        newlines but placed on the same line as
                        left-to-right text such as incomplete words.


6.  Special cases

a. combined Arabic/non-Arabic tokens

The transcripts occasionally contain mixtures such as:

    il+(( ))
    li+(( ))
    il+((SiHHaB~))

    il+<MSA cumalAC>
    il+<MSA salAmu calaykum>

    il+<English e-mail>
    fa+<English already>
    il+<English green card>

These present a problem, since putting a newline between the Arabic
and non-Arabic would lose the information that in some sense they are
single tokens.  We tried to choose a fairly neutral compromise: we
isolated such cases on individual lines and converted the unbracketed
portion to Arabic script.  Users of the transcripts are then free to
do what they like with such cases.  Note that Mule displays these in
"backwards" order, with the Arabic on the left and the English on the
right, but the logical order in the file is correct.

b. ambiguous spellings

As explained elsewhere in the documentation of this corpus, the
romanized text is occasionally ambiguous with respect to the Arabic
script.  There are several hundred entries in the lexicon with two or
more possible spellings in Arabic script, corresponding to several
thousand ambiguous lexical tokens in the transcripts.

These were handled as follows: First, we output the alternatives,
separated by "||" and placed on separate lines to prevent ambiguity of
turn order.  Then, the transcription staff searched for vertical bars
in the transcripts and edited them manually to choose the correct
alternative.


7.  MULE editor and viewing the text

We used an ISO-to-Mule converter graciously provided by TAKAHASHI
Naoto <ntakahas@etl.go.jp> to view and edit the Arabic script
transcripts in the MULE (multi-lingual emacs) editor.  With his
permission, we are including his converter in this distribution, with
a few alterations we made to facilitate its use with our transcripts.

Mule is freely available via anonymous FTP from various sites,
including ftp.cs.buffalo.edu and etlport.etl.go.jp.  The ISO-to-Mule
converter seems to work fine with Mule version 2.2 (1994) and version
2.3 (1995).