BOLT Discussion Forum Chinese Parallel Training Data
Authors: Zhiyi Song, Jennifer Garland, Christopher Walker, Stephanie
Strassel


1.0 Introduction

This file contains documentation for BOLT Discussion Forum Chinese Parallel
Training Data.

The DARPA BOLT (Broad Operational Language Translation) Program developed
genre-independent machine translation and information retrieval
systems. While earlier DARPA programs made significant strides in improving
natural language processing capabilities in structured genres like newswire
and broadcasts, BOLT was particularly concerned with improving translation
and information retrieval performance for less-formal genres with a special
focus on user-contributed content. 

LDC supported the BOLT Program by collecting informal data sources --
discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic
and English (Garland, et al. 2012; Song et al. 2014). The collected data 
was translated and richly annotated for a variety of tasks including word 
alignment, Treebanking, PropBanking, and co-reference. LDC supported the 
evaluation of BOLT technologies by post-editing machine translation system 
output and assessing information retrieval system responses during annual 
evaluations conducted by NIST. 

The parallel text in this release comprises Chinese training data BOLT
Phase 1. The source texts were selected from LDC2016T05 -- BOLT Chinese
Discussion Forums (Tracey et al. 2016). This corpus contains 1,876,799 
tokens of Chinese discussion forum (DF) data collected for BOLT, along 
with their corresponding English translations. 

2.0 Package Structure

README.txt - this file

This package comprises two directories:

data/
For each translated source document there are a pair of files in the
release:

  /source/bolt-<sourcelang>-DF-<site_id>-<group_id>-<thread-
  id>.<sourcelang>.su.xml
and
  /target/bolt-<sourcelang>-DF-<site_id>-<group_id>-<thread-
  id>.<targetlang>.su.xml

where
    <sourcelang> is cmn
    <site_id>    is a numeric ID associated with the web site
    <group_id>   is a numeric ID associated with the forum
    <thread_id>  is a numeric ID associated with the discussion thread
    <targetlang> is eng

docs/
     filestats.tab      - inventory of source files
        for Chinese, including a token count for each file
        (for Chinese, a token is a character).
     filelist.txt       - inventory of files in this release
     multipost_su.dtd   - dtd
     BOLT_Chinese_translation_guidelines_v2.6.pdf
                        -Chinese translation guidelines

3.0 Contents

This release includes Chinese-English source files and translations. 
The following table shows the data volume of this package (count on 
the source side):

      source lang   genre    files       sourceTokens	targetTokens
      --------------------------------------------------------------

      Chinese        DF      1541         1,876,779	1,557,873

Token counts are expressed in terms of characters in Chinese source and 
word in English target.

Each .xml file contains the set of specific posts (divided into sentence
units) that were selected and translated for a given thread. The
corresponding full threads can be found in the BOLT - Phase 1 Discussion
Forums Chinese Source Data (LDC201XTXX). The filestems are consistent
between the current release and the full-thread version in LDC201XTXX,
though the file extension is different.

The file docs/filestats.tab contains the inventory of documents
with the token count for each file.

3.1 XML Format
Each .xml file contains the set of specific posts (divided into sentence
units) that were selected and translated for a given thread. Selected posts
may be discontiguous (e.g. a file may contain posts 1 and 9-14). Typically
all SUs for a given post are included, though SUs that contain only quoted
material, for example, may have been excluded from translation.

Posts within each thread are surrounded by <post> tags. Sentence units are
surrounded by <su> tags. A multi-post ID is assigned which is identical to
the filestem. Post ID and SU IDs are also assigned.

See docs/multipost_su.dtd for more details.


3.2 Encoding

All data are encoded in UTF8.

4.0 Translation Pipeline

A manual selection procedure was used to choose data appropriate for
translation and distribution to BOLT. Selection criteria included
linguistic features (is the file in Egyptian Arabic or Mandarin Chinese),
and topic features (does the file contain current or dynamtic events).

After selection, selected posts were segmented into sentence units (SU).
Then files were reformatted into a human-readable translation format and
were assigned to professional translators for careful translation.
Translators followed LDC's BOLT Translation guidelines, which describe the
makeup of the translation team, the source data format, the translation data
format, best practices for translating certain linguistic features (such as
names and special issues in discussion forum data), and quality control
procedures applied to completed translations.

After translations were completed, bilingual LDC staff performed
quality control by selecting a proportional sample from each delivery and
scrutinizing it for several kinds of mistakes, as described in the
translation guidelines. Low quality translations were returned to
the translators for revision.  After quality control is complete,
translation files were validated and reformatted into the release format.

5.0 Translation mark-up
Below are the translation tags that we use to mark certain features in the
target translation:

         Translation:
            - translation alternatives  [intended meaning | literal meaning]
            - correction of typo        =text
            - best guess translation    ((text))

6.0 Notes

For some posts su1 and su2 contain duplicate content. This is due to
subject lines being erroneously included in the translation set. We have
made an effort to exclude this type of duplication in this releases.

Similarly, there is some duplication of sentences across multiple posts in
a thread. We have made an effort to remove identical sentences within a
post prior to translating data of this release.

7.0 Sanity Checks

LDC performed the following corpus-wide checks and corrected all errors
found:

      -- Number of source segments matches number of translation segments
         for all files
      -- All non-blank source segments correspond to non-blank translation
         segments
      -- All translation files have a corresponding source file
      -- All files contain only UTF-8 encoded characters, although they may
         contain non-ascii characters such as Western European characters
      -- Punctuation in translations is ASCII punctuation

8.0 Acknowledgement
This material is based upon work supported by the Defense Advanced Research
Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content
does not necessarily reflect the position or the policy of the Government,
and no official endorsement should be inferred.

9.0 Copy Right
Portions © 2017 Trustees of the University of Pennsylvania

10. Reference
Jennifer Garland, Stephanie Strassel, Safa Ismael, Zhiyi Song, Haejoong Lee
Linguistic Resources for Genre-Independent Language Technologies: 
User-Generated Content in BOLT
LREC 2012: 8th International Conference on Language Resources and Evaluation,
Istanbul, May 21-27

Zhiyi Song, Stephanie Strassel, Haejoong Lee, Kevin Walker, Jonathan Wright, 
Jennifer Garland, Dana Fore, Brian Gainor, Preston Cabe, Thomas Thomas, 
Brendan Callahan, Ann Sawyer 
Collecting Natural SMS and Chat Conversations in Multiple Languages: The BOLT 
Phase 2 Corpus 
LREC 2014: 9th Edition of the Language Resources and Evaluation Conference, 
Reykjavik, May 26-31

Tracey, Jennifer, et al. BOLT Chinese Discussion Forums LDC2016T05. Web Download. 
Philadelphia: Linguistic Data Consortium, 2016.

11. Contact
Zhiyi Song		zhiyi@ldc.upenn.edu	Project Manager
Jennifer Tracey		garjen@ldc.upenn.edu	Project Manager
Stephanie Strassel	strassel@ldc.upenn.edu	BOLT PI

--
README Created Zhiyi Song, July 31, 2014
       Updated Zhiyi Song, April 18, 2016