GALE Phase 4 Arabic NW Parallel Sentences Authors: Zhiyi Song, Gary Krug, Stephanie Strassel 1.0 Introduction This file contains documentation for GALE Phase 4 Arabic NW Parallel Sentences. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE Program. This corpus contains Modern Standard Arabic (MSA) source sentences and corresponding English translations for 62,669 words, selected from Newswire (NW) data collected in GALE. 2.0 Package Structure This package comprises two directories: data/ The data directory is divided into four branches. These are "source_full", "source_selected", "translation_selected", and"index". These contain the full source tdf files, selected segments from source, selected translation segments files, and index files respectively. source_full/ contains source full documents from which the sentences are selected in tdf format source_selected/ contains source sentences selected for transaltion from each document in tdf format translation_selected/ contains translated sentences correpsonding to source_selected/ in tdf format index/ index files which indicate which sentences are selected from each full document with the first column listing file stems, second column listing segments in source_selected and third column listing corresponding segments in source_full File names refer to the data source, source language and collection date, e.g. {SRC}_{LNG}_YYYYMMDD_NNNN(.fileTypeExtension) where - {SRC} is the source ID (e.g., CNN, VOA, etc.) - {LNG} is the three-letter language ID defined in the ISO639-3. ARB is Standard Arabic; CMN is Chinese, Mandarin; ENG is English. - YYYYMMDD is the data collection (publish) date. - NNNN is a four digit ID assigned to the file. File stems for source and translation file pairs are the same. Source full files use the .tdf extension. Source selected files use the sel.tdf extension, while translation files use the sel.eng.tdf extension. docs/ The docs directory contains documentation related to the release. docs/README.txt - this file docs/doc_list.txt - inventory of source and translation files with the token count for each file docs/file_list.txt - inventory of files in this release docs/GALE_Arabic_Translation_Guidelines_sentence_based_V2.pdf - translation guidelines docs/TDF_format.txt -- TDF format description 3.0 Contents This release includes 393 source-translation document pairs, comprising 62,669 words of Arabic source and their English translation. Data is drawn from 6 distinct Arabic newswire (NW) sources. The following table is a summary of data included in this corpus. source_lang genre files sentences source_tokens ----------------------------------------------------------- Arabic NW 393 2893 62,669 Token counts are expressed in terms of words (using the regular expression w+) and are taken from the source data. The file called docs/file_list.txt contains a complete list of files in the package. The file docs/doc_list.txt contains the inventory of translated docuemnts with the source token count for each file. 3.1 TDF Format Source data and translations are distributed in TDF format. TDF files are tab-delimited text files containing one segment of text along with meta information about that segment. Each field in the TDF file is described in docs/TDF_format.txt. A source TDF file and its translation are the same except that the transcript in the source TDF is replaced by its English translation. 3.2 Encoding All data are encoded in UTF8. 4.0 Translation Pipeline Newswire files were first automaticly segmented into sentences. Segmented sentences were then fed into the sentence-seletion scripts provided by IBM and SRI for selection. The selected sentences were reviewed by LDC annotators so that sentences that are not in the target language or dialect, or have formatting problems, or whose content is entirely unsuitable, such as commercial, religious, scams etc. are rejected. LDC also ran duplicate detection on all sentences prepared for translation and excluded any duplicate sentences from the translation pipeline. After review, selected sentences from the same documents were grouped into selection file which was reformatted into a human-readable translation format and assigned to translation vendors under contract to LDC. Source full documents with selected sentences highlighted were also provided to translators for context. Translators followed LDC's GALE Translation guidelines, which describe the makeup of the translation team, the source data format, the translation data format, best practices for translating certain linguistic features (such as proper names and numbers), and quality control procedures applied to completed translations. Translators used special markup to indicate particular linguistic features, for instance partial words and typos in the source text; these uses are described in the documentation accompanying this release. After translations were completed, bilingual LDC staff performed quality control by selecting a proportional sample from each delivery and scrutinizing it for several kinds of mistakes, as described in the translation guidelines. Low quality translations were returned to the translators for revision. After quality control is complete, translation files were validated and reformatted into the release format. 5.0 Sanity Checks LDC performed the following corpus-wide checks and corrected all errors found: -- Number of source segments matches number of translation segments for all files (except full source text) -- All non-blank source segments correspond to non-blank translation segments -- All translation files have a corresponding full source file selected source file, and index file -- All files contain only UTF-8 encoded characters, although they may contain non-ascii characters such as Western European characters -- Punctuation in translations is ASCII punctuation 6.0 Acknowledgement This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. 7.0 Content Copyright Portions © 2008 Agence France Presse, Al-Ahram, Al Hayat, Al-Quds Al-Arabi, An Nahar, Asharq Al-Awsat, © 2008, 2016 Trustees of the University of Pennsylvania ---- README Created 29 June, 2011 Gary Krug Updated 31 January, 2011 Zhiyi Song Updated 20 February, 2012 Stephanie Strassel Updated 20 February, 2012 Zhiyi Song Updated 1 June, 2012 Zhiyi Song