Arabic Broadcast News Transcripts ================================================= This data set consists of eight text files containing transcripts for Voice of America (VOA) satellite radio news broadcasts in Arabic. The broadcasts were recorded by the Linguistic Data Consortium (LDC) at transmission time between June 2000 and January 2001. Six broadcasts are 60 minutes long, and two broadcasts are 120 minutes long. The file names indicate the date (YYYYMMDD) and the begin and end times (HHMM EST) of the original transmission. This work was sponsored in part by National Science Foundation Grant No. IIS-9982201. The character encoding is entirely in ASCII: Buckwalter transliteration is used for rendering the Arabic text content. Time alignment and structural markup are rendered via "pseudo-SGML" tags, which are presented one tag per line, with the first character of the line being an open angle bracket. The lines of transcription text (i.e. the speech and annotation content between the time-stamp tags) all begin with a single space character, and present exactly one token per line. (A "token" may be a spoken Arabic word, a punctuation mark, or a single Arabic word enclosed by "(%" and ")", which represents an annotation of a non-speech condition or event (e.g. "music", "noise", "laugh", etc). For token lines containing spoken words, there are three tab-delimited fields for each token: - The first is the "consonant skeleton" orthographic form of the word, equivalent to what one would typically find in an Arabic newspaper (i.e. lacking short vowels and other diacritic marks). - The second is the "vowelized" or "diacritized" orthographic form, as derived from the Buckwalter Morphological Analyzer, for the given skeletal form. - The third is the morphological analysis for the word, in which the morpheme boundaries are marked by "+" (plus-sign), and each morpheme is represented by both its vowelized orthography and its part-of-speech tag; these two components are separate by "/" (slash) within each "+"-delimited morpheme string. The second and third fields represent the "contextualized" analysis for the skeletal form; that is, the morphological and semantic ambiguity of each skeletal orthographic form has been resolved (disambiguated) based on manual judgment of each token in context. Across all transcripts, there are 29 spoken word tokens (26 distinct forms) that lack the second and third fields; these are generally cases of forms that occur elsewhere in the corpus, but disambiguation was not completed for whatever reason. Also, there are 331 tokens (270 distinct forms) in which the vowelization and morphological analysis essentially failed, leaving the vowelized form identical to the skeletal form, and a morphological "analysis" of either "NO_FUNC" (251 tokens, 197 forms) or "TRANSERR" (80 tokens, 73 forms). Many of these are due to speakers using dialectal Arabic forms rather than MSA (a common situation during interview segments). Note that the initial character of each line in the transcript file (being either angle-bracket or space) is an important cue for handling the transcript content correctly, because angle brackets, which are placed around the time-stamp markers, are also used to represent Arabic characters in the Buckwalter transliteration, and many word tokens begin with an open-angle-bracket. The time-stamp annotations mark story boundaries, turn boundaries within stories, and "phrase" boundaries within turns (where "phrase" typically relates to a breath group, not always a syntactic boundary). For each story and turn boundary, the tag also includes information about the speaker. We believe that the identification of speakers within a given transcript file is consistent over the full duration of each file. However, there has been little or no attempt to consistently identify the same speaker across transcripts. The "perl" directory contains a script (pick_tr_layer.pl) that can be used to derive less complex, more legible/useful versions of the transcript files. If you have a current version of Perl installed (5.8.1 or later), you can run "perldoc pick_tr_layer.pl" to display the user documentation for the script. The "Transcriber" annotation tool, developed by DGA (authored by Edouard Geoffrois and Claude Barras and available from the LDC web site), supports the file format produced by "pick_tr_layer.pl", if you apply the "-enc dga" option provided by the script in order to adjust the transliteration. (The transliteration used by Transcriber differs from Buckwalter in a few significant details.) There are also two sub-directories within "perl", containing module and installation files that can be used to enable character encoding conversions between Buckwalter, UTF-8, and the DGA/Transcriber transliteration. For each module, there is a ".ucm" file (in the respective "Encode" sub-directories) that provides a fully detailed list of the mapping between the transliteration and the Unicode Arabic character set. The "pick_tr_layer.pl" provides a simple demonstration of how to use these encoding modules, and you can run "perldoc" on the module files (Buckwalter.pm, DGAtransar.pm) to see their respective user documentation. The "docs" directory contains a single file, "unigrams.txt", which is simply a tab-delimited table of distinct word forms and analyses, together with their frequencies of occurrence in the corpus. Speech files for these recordings are available as a separate corpus from the LDC: Arabic Broadcast News Speech, LDC2006S46. ldc@ldc.upenn.edu