BOLT Discussion Forum Chinese Parallel Training Data Authors: Zhiyi Song, Jennifer Garland, Christopher Walker, Stephanie Strassel 1.0 Introduction This file contains documentation for BOLT Discussion Forum Chinese Parallel Training Data. The DARPA BOLT (Broad Operational Language Translation) Program developed genre-independent machine translation and information retrieval systems. While earlier DARPA programs made significant strides in improving natural language processing capabilities in structured genres like newswire and broadcasts, BOLT was particularly concerned with improving translation and information retrieval performance for less-formal genres with a special focus on user-contributed content. LDC supported the BOLT Program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English (Garland, et al. 2012; Song et al. 2014). The collected data was translated and richly annotated for a variety of tasks including word alignment, Treebanking, PropBanking, and co-reference. LDC supported the evaluation of BOLT technologies by post-editing machine translation system output and assessing information retrieval system responses during annual evaluations conducted by NIST. The parallel text in this release comprises Chinese training data BOLT Phase 1. The source texts were selected from LDC2016T05 -- BOLT Chinese Discussion Forums (Tracey et al. 2016). This corpus contains 1,876,799 tokens of Chinese discussion forum (DF) data collected for BOLT, along with their corresponding English translations. 2.0 Package Structure README.txt - this file This package comprises two directories: data/ For each translated source document there are a pair of files in the release: /source/bolt--DF---..su.xml and /target/bolt--DF---..su.xml where is cmn is a numeric ID associated with the web site is a numeric ID associated with the forum is a numeric ID associated with the discussion thread is eng docs/ filestats.tab - inventory of source files for Chinese, including a token count for each file (for Chinese, a token is a character). filelist.txt - inventory of files in this release multipost_su.dtd - dtd BOLT_Chinese_translation_guidelines_v2.6.pdf -Chinese translation guidelines 3.0 Contents This release includes Chinese-English source files and translations. The following table shows the data volume of this package (count on the source side): source lang genre files sourceTokens targetTokens -------------------------------------------------------------- Chinese DF 1541 1,876,779 1,557,873 Token counts are expressed in terms of characters in Chinese source and word in English target. Each .xml file contains the set of specific posts (divided into sentence units) that were selected and translated for a given thread. The corresponding full threads can be found in the BOLT - Phase 1 Discussion Forums Chinese Source Data (LDC201XTXX). The filestems are consistent between the current release and the full-thread version in LDC201XTXX, though the file extension is different. The file docs/filestats.tab contains the inventory of documents with the token count for each file. 3.1 XML Format Each .xml file contains the set of specific posts (divided into sentence units) that were selected and translated for a given thread. Selected posts may be discontiguous (e.g. a file may contain posts 1 and 9-14). Typically all SUs for a given post are included, though SUs that contain only quoted material, for example, may have been excluded from translation. Posts within each thread are surrounded by tags. Sentence units are surrounded by tags. A multi-post ID is assigned which is identical to the filestem. Post ID and SU IDs are also assigned. See docs/multipost_su.dtd for more details. 3.2 Encoding All data are encoded in UTF8. 4.0 Translation Pipeline A manual selection procedure was used to choose data appropriate for translation and distribution to BOLT. Selection criteria included linguistic features (is the file in Egyptian Arabic or Mandarin Chinese), and topic features (does the file contain current or dynamtic events). After selection, selected posts were segmented into sentence units (SU). Then files were reformatted into a human-readable translation format and were assigned to professional translators for careful translation. Translators followed LDC's BOLT Translation guidelines, which describe the makeup of the translation team, the source data format, the translation data format, best practices for translating certain linguistic features (such as names and special issues in discussion forum data), and quality control procedures applied to completed translations. After translations were completed, bilingual LDC staff performed quality control by selecting a proportional sample from each delivery and scrutinizing it for several kinds of mistakes, as described in the translation guidelines. Low quality translations were returned to the translators for revision. After quality control is complete, translation files were validated and reformatted into the release format. 5.0 Translation mark-up Below are the translation tags that we use to mark certain features in the target translation: Translation: - translation alternatives [intended meaning | literal meaning] - correction of typo =text - best guess translation ((text)) 6.0 Notes For some posts su1 and su2 contain duplicate content. This is due to subject lines being erroneously included in the translation set. We have made an effort to exclude this type of duplication in this releases. Similarly, there is some duplication of sentences across multiple posts in a thread. We have made an effort to remove identical sentences within a post prior to translating data of this release. 7.0 Sanity Checks LDC performed the following corpus-wide checks and corrected all errors found: -- Number of source segments matches number of translation segments for all files -- All non-blank source segments correspond to non-blank translation segments -- All translation files have a corresponding source file -- All files contain only UTF-8 encoded characters, although they may contain non-ascii characters such as Western European characters -- Punctuation in translations is ASCII punctuation 8.0 Acknowledgement This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. 9.0 Copy Right Portions © 2017 Trustees of the University of Pennsylvania 10. Reference Jennifer Garland, Stephanie Strassel, Safa Ismael, Zhiyi Song, Haejoong Lee Linguistic Resources for Genre-Independent Language Technologies: User-Generated Content in BOLT LREC 2012: 8th International Conference on Language Resources and Evaluation, Istanbul, May 21-27 Zhiyi Song, Stephanie Strassel, Haejoong Lee, Kevin Walker, Jonathan Wright, Jennifer Garland, Dana Fore, Brian Gainor, Preston Cabe, Thomas Thomas, Brendan Callahan, Ann Sawyer Collecting Natural SMS and Chat Conversations in Multiple Languages: The BOLT Phase 2 Corpus LREC 2014: 9th Edition of the Language Resources and Evaluation Conference, Reykjavik, May 26-31 Tracey, Jennifer, et al. BOLT Chinese Discussion Forums LDC2016T05. Web Download. Philadelphia: Linguistic Data Consortium, 2016. 11. Contact Zhiyi Song zhiyi@ldc.upenn.edu Project Manager Jennifer Tracey garjen@ldc.upenn.edu Project Manager Stephanie Strassel strassel@ldc.upenn.edu BOLT PI -- README Created Zhiyi Song, July 31, 2014 Updated Zhiyi Song, April 18, 2016