This data set consists of eight text files containing transcripts for Voice of America satellite radio news broadcasts in Arabic. The broadcasts were recorded by the Linguistic Data Consortium at transmission time between June 2000 and January 2001.
Six broadcasts are 60 minutes long, and two broadcasts are 120 minutes long. The file names indicate the date (YYYYMMDD) and the begin and end times (HHMM EST) of the original transmission. This work was sponsored in part by National Science Foundation Grant No. IIS-9982201.
The character encoding is entirely in ASCII: Buckwalter transliteration is used for rendering the Arabic text content. Time alignment and structural markup are rendered via "pseudo-SGML" tags, which are presented one tag per line, with the first character of the line being an open angle bracket.
The lines of transcription text (i.e. the speech and annotation content between the time-stamp tags) all begin with a single space character, and present exactly one token per line. (A "token" may be a spoken Arabic word, a punctuation mark, or a single Arabic word enclosed by "(%" and ")", which represents an annotation of a non-speech condition or event (e.g. "music", "noise", "laugh", etc).
For an example of the data contained in this corpus, please examine this screenshot of the transcription.
Portions © 2000, 2001, 2002, 2005, 2006 Trustees of the University of Pennsylvania