GALE Phase 3 and 4 Arabic BN Parallel Text Part 2 Authors: Zhiyi Song, Gary Krug, Stephanie Strassel 1.0 Introduction This file contains documentation for GALE Phase 3 and 4 Arabic BN Parallel Text Part 2. Along with other corpora, the parallel text in this release comprised training data for Phase 3 and 4 of the DARPA GALE Program. This corpus contains Modern Standard Arabic (MSA) source text and corresponding English translations for 166,081 tokens, selected from Broadcast News (BN) data collected and transcribed in GALE. The audio corresponding to the source files in this release are distributed separately. 2.0 Package Structure This package comprises two directories: data/ The data directory is divided into the "source" directory and "translation directory. The "source" directory contains files in the source language. The "translation" directory contains the translated files. File names refer to the data source, source language and collection date, e.g. {SRC}_{PRG}_{LNG}_YYYYMMDD_hhmmss(.fileTypeExtension) where - {SRC} is the source ID (e.g., CNN, VOA, etc.) - {PRG} is the program ID (e.g., LARRYKING, etc.) - {LNG} is the three-letter language ID defined in the ISO639-3. ARB is Standard Arabic; CMN is Chinese, Mandarin; ENG is English. - YYYYMMDD is the data collection (broadcast) date. - hhmmss is the start time of the program (hh is the hour in the 24-hour format) File stems for source and translation file pairs are the same. Source files use the .tdf extension, while translation files use the eng.tdf extension. docs/ The docs directory contains documentation related to the release. docs/README.txt - this file docs/doc_list.txt - inventory of source and translation files with the token count for each file docs/file_list.txt - inventory of files in this release docs/GALE_Arabic_Translation_Guidelines_V2_7.pdf - translation guidelines docs/GALE_TranscriptionTranslationMarkup_V25.xls - explanation of special symbols docs/program_summary.txt -- audio programs in this release docs/TDF_format.txt -- TDF format description 3.0 Contents This release includes 45 source-translation document pairs, comprising 166,081 words of translated data. Data is drawn from 23 distinct Arabic broadcast news (BN) sources. The following table is a summary of the files by data source included in this release. Source Program Epoch tokens ABUDHABI ABUDHNEWS2 2007.02 2954 ABUDHABI ABUDHNEWS 2007.01 5569 ABUDHABI NEWSHOUR 2008.04 5394 ALAM NEWSRPT 2007.01 6572 ALAM NEWSRPT 2007.01 6113 ALBAGHDADYA BAGHDADYANEWS 2008.04 3389 ALHURRA THEWORLDNOW 2008.01 3203 ALURDUNYA URDUNYANEWS 2007.03 1794 ARABIYA ALARABIYANEWS2 2007.03 2762 ARABIYA ALARABIYANEWS2 2007.03 2841 ARABIYA LATEHRNEWS 2007.02 2064 ARABIYA PANORAMA 2007.02 6165 ARABIYA PANORAMA 2007.03 5907 ARABIYA PANORAMA 2007.03 6157 ARABIYA PANORAMA 2007.03 5923 ARABIYA PANORAMA 2008.03 6436 DUBAI DUBAINEWS2 2007.02 2821 DUBAI DUBAINEWS2 2007.03 2734 DUBAI DUBAINEWS2 2007.03 2983 IRAQIYAH ECONRPT 2007.01 1971 IRAQIYAH ECONRPT 2007.01 1423 IRAQIYAH ECONRPT 2007.02 1572 IRAQIYAH IRAQINEWS 2008.03 3528 IRAQIYAH IRAQINEWS 2008.03 2187 IRAQIYAH IRAQINEWS 2008.03 3245 IRAQIYAH IRAQINEWS 2008.03 3696 IRAQIYAH IRAQTDY 2007.03 1813 KUWAITTV NEWS 2007.01 2659 KUWAITTV NEWS 2007.02 3178 LBC NEWS 2007.01 5086 LBC NEWS 2007.02 3353 OMANTV NEWS 2008.03 3021 SAUDITV SAUDINEWS2 2008.03 5174 SAWA SAWANEWS 2008.01 641 SAWA SAWANEWS 2008.02 1897 SAWA SAWANEWS 2008.02 2720 SAWA SAWANEWS 2008.03 737 SCOLA JORDNNSCO 2007.01 3386 SCOLA JORDNNSCO 2007.03 2999 SCOLA SAUDNNSCO 2007.02 2977 SYRIANTV NEWS25 2007.01 4851 SYRIANTV NEWS25 2007.01 6899 SYRIANTV NEWS25 2007.02 5697 SYRIANTV NEWS25 2007.02 5247 SYRIANTV NEWS25 2007.02 4343 Token counts are expressed in terms of words for Arabic (using the regular expression w+) and are taken from the source data. The file called docs/file_list.txt contains a complete list of files in the package. The file docs/doc_list.txt contains the inventory of documents with the token count for each file. 3.1 TDF Format TDF files are tab-delimited text files containing one segment of text along with meta information about that segment. Each field in the TDF file is described in docs/TDF_format.txt. A source TDF file and its translation are the same except that the transcript in the source TDF is replaced by its English translation. 3.2 Encoding All data are encoded in UTF8. 4.0 Translation Pipeline A manual selection procedure was used to choose data appropriate for translation and distribution to GALE. Selection criteria included linguistic features (is the file in MSA), transcription features (is the transcription good enough to produce a viable translation) and topic features (does the file contain news, current events or human interest topics). Before audio files can be translated, they must be transcribed. The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC. In addition to producing a verbatim transcript, transcribers also indicate sentence boundaries. Sentence boundaries and overall transcript quality were verified by LDC staff before sending files out for translation. After transcription and SU annotation, files were reformatted into a human- readable translation format and were assigned to professional translators for careful translation. Translators followed LDC's GALE Translation guidelines, which describe the makeup of the translation team, the source data format, the translation data format, best practices for translating certain linguistic features (such as names and speech disfluencies), and quality control procedures applied to completed translations. Transcribers and translators used special markup to indicate particular linguistic features, for instance unintelligible speech, partial words and typos in the transcript; these uses are described in the documentation accompanying this release. After translations were completed, bilingual LDC staff performed quality control by selecting a proportional sample from each delivery and scrutinizing it for several kinds of mistakes, as described in the translation guidelines. Low quality translations were returned to the translators for revision. After quality control is complete, translation files were validated and reformatted into the release format. 5.0 Sanity Checks LDC performed the following corpus-wide checks and corrected all errors found: -- Number of source segments matches number of translation segments for all files (except full source text) -- Timestamps are identical between selected source and translation -- All non-blank source segments correspond to non-blank translation segments -- All translation files have a corresponding full source file selected source file, and index file -- All files contain only UTF-8 encoded characters, although they may contain non-ascii characters such as Western European characters -- Punctuation in translations is ASCII punctuation 6.0 Acknowledgement This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. 7.0 Content Copyright [To be supplied by Publications/IPR] ---- README Created 29 June, 2011 Gary Krug Updated 31 January, 2011 Zhiyi Song Updated 20 February, 2012 Stephanie Strassel Updated 30 April, 2012 Zhiyi Song