BOLT Arabic Discussion Forum Parallel Training Data Authors: Zhiyi Song, Jennifer Tracey, Christopher Walker, Stephanie Strassel 1.0 Introduction This file contains documentation for BOLT Discussion Forum Arabic Parallel Training Data. The parallel text in this release comprised training data for Phase 1 of the DARPA BOLT Program. This corpus contains all Egyptian Arabic source text and corresponding English translations for 1,169,599 tokens, selected from Arabic Discussion Forum (DF) collected in BOLT. 2.0 Package Structure README.txt - this file This package comprises two directories: data/ For each translated source document there are a pair of files in the release: /source/bolt--DF---..su.xml and /target/bolt--DF---..su.xml where is arz is a numeric ID associated with the web site is a numeric ID associated with the forum is a numeric ID associated with the discussion thread is eng docs/ filestats.tab - inventory of source files for Arabic, including a token count for each file (for Arabic, a token is a word). filelist.txt - inventory of files in this release multipost_su.dtd - dtd BOLT_Arabic_translation_guidelines_v1.9.pdf -Arabic translation guidelines 3.0 Contents The release includes both Arabic-English source files and translations. The following table shows the data volume of this package (count on the source side): source lang genre files source tokens ------------------------------------------------ Arabic DF 2651 1,169,599 Token counts are expressed in terms of words in Arabic. Each .xml file contains the set of specific posts (divided into sentence units) that were selected and translated for a given thread. The corresponding full threads can be found in the BOLT Arabic Discussion Forums (LDC2018T10). The filestems are consistent between the current release and the full-thread version in LDC2018T10, though the file extension is different. The file docs/filestats.tab contains the inventory of documents with the token count for each file. 3.1 XML Format Each .xml file contains the set of specific posts (divided into sentence units) that were selected and translated for a given thread. Selected posts may be discontiguous (e.g. a file may contain posts 1 and 9-14). Typically all SUs for a given post are included, though SUs that contain only quoted material, for example, may have been excluded from translation. Posts within each thread are surrounded by tags. Sentence units are surrounded by tags. A multi-post ID is assigned which is identical to the filestem. Post ID and SU IDs are also assigned. See docs/multipost_su.dtd for more details. 3.2 Encoding All data are encoded in UTF8. 4.0 Translation Pipeline A manual selection procedure was used to choose data appropriate for translation and distribution to BOLT. Selection criteria included linguistic features (is the file in Egyptian Arabic or Mandarin Chinese), and topic features (does the file contain current or dynamtic events). After selection, selected posts were segmented into sentence units (SU). Then files were reformatted into a human-readable translation format and were assigned to professional translators for careful translation. Translators followed LDC's BOLT Translation guidelines, which describe the makeup of the translation team, the source data format, the translation data format, best practices for translating certain linguistic features (such as names and special issues in discussion forum data), and quality control procedures applied to completed translations. After translations were completed, bilingual LDC staff performed quality control by selecting a proportional sample from each delivery and scrutinizing it for several kinds of mistakes, as described in the translation guidelines. Low quality translations were returned to the translators for revision. After quality control is complete, translation files were validated and reformatted into the release format. 5.0 Translation mark-ups Below are the translation tags that we use to mark certain features in the target translation: Translation: - translation alternatives [intended meaning | literal meaning] - correction of typo =text - best guess translation ((text)) 6.0 Notes For some posts su1 and su2 contain duplicate content. This is due to subject lines being erroneously included in the translation set. We have made an effort to exclude this type of duplication in this releases. Similarly, there is some duplication of sentences across multiple posts in a thread. We have made an effort to remove identical sentences within a post prior to translating data of this release. 7.0 Sanity Checks LDC performed the following corpus-wide checks and corrected all errors found: -- Number of source segments matches number of translation segments for all files -- All non-blank source segments correspond to non-blank translation segments -- All translation files have a corresponding source file -- All files contain only UTF-8 encoded characters, although they may contain non-ascii characters such as Western European characters -- Punctuation in translations is ASCII punctuation 8.0 Acknowledgement This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. -- README Created Zhiyi Song, July 31, 2014