BOLT Arabic Discussion Forum Parallel Training Data
Authors: Zhiyi Song, Jennifer Tracey, Christopher Walker, Stephanie
Strassel


1.0 Introduction

This file contains documentation for BOLT Discussion Forum Arabic Parallel
Training Data.

The parallel text in this release comprised training data for Phase 1 of the
DARPA BOLT Program. This corpus contains all Egyptian Arabic source text
and corresponding English translations for 1,169,599 tokens, selected from
Arabic Discussion Forum (DF) collected in BOLT.

2.0 Package Structure

README.txt - this file

This package comprises two directories:

data/
For each translated source document there are a pair of files in the
release:

  /source/bolt-<sourcelang>-DF-<site_id>-<group_id>-<thread-
  id>.<sourcelang>.su.xml
and
  /target/bolt-<sourcelang>-DF-<site_id>-<group_id>-<thread-
  id>.<targetlang>.su.xml

where
    <sourcelang> is arz
    <site_id>    is a numeric ID associated with the web site
    <group_id>   is a numeric ID associated with the forum
    <thread_id>  is a numeric ID associated with the discussion thread
    <targetlang> is eng

docs/
     filestats.tab      - inventory of source files
        for Arabic, including a token count for each file
        (for Arabic, a token is a word).
     filelist.txt       - inventory of files in this release
     multipost_su.dtd   - dtd
     BOLT_Arabic_translation_guidelines_v1.9.pdf
                        -Arabic translation guidelines

3.0 Contents

The release includes both Arabic-English source files and translations. 
The following table shows the data volume of this package (count on the 
source side):

      source lang   genre    files       source tokens
      ------------------------------------------------

      Arabic        DF       2651         1,169,599

Token counts are expressed in terms of words in Arabic.


Each .xml file contains the set of specific posts (divided into sentence
units) that were selected and translated for a given thread. The
corresponding full threads can be found in the BOLT Arabic Discussion 
Forums (LDC2018T10). The filestems are consistent
between the current release and the full-thread version in LDC2018T10,
though the file extension is different.

The file docs/filestats.tab contains the inventory of documents
with the token count for each file.

3.1 XML Format
Each .xml file contains the set of specific posts (divided into sentence
units) that were selected and translated for a given thread. Selected posts
may be discontiguous (e.g. a file may contain posts 1 and 9-14). Typically
all SUs for a given post are included, though SUs that contain only quoted
material, for example, may have been excluded from translation.

Posts within each thread are surrounded by <post> tags. Sentence units are
surrounded by <su> tags. A multi-post ID is assigned which is identical to
the filestem. Post ID and SU IDs are also assigned.

See docs/multipost_su.dtd for more details.


3.2 Encoding

All data are encoded in UTF8.

4.0 Translation Pipeline

A manual selection procedure was used to choose data appropriate for
translation and distribution to BOLT. Selection criteria included
linguistic features (is the file in Egyptian Arabic or Mandarin Chinese),
and topic features (does the file contain current or dynamtic events).

After selection, selected posts were segmented into sentence units (SU).
Then files were reformatted into a human-readable translation format and
were assigned to professional translators for careful translation.
Translators followed LDC's BOLT Translation guidelines, which describe the
makeup of the translation team, the source data format, the translation data
format, best practices for translating certain linguistic features (such as
names and special issues in discussion forum data), and quality control
procedures applied to completed translations.

After translations were completed, bilingual LDC staff performed
quality control by selecting a proportional sample from each delivery and
scrutinizing it for several kinds of mistakes, as described in the
translation guidelines. Low quality translations were returned to
the translators for revision.  After quality control is complete,
translation files were validated and reformatted into the release format.

5.0 Translation mark-ups
Below are the translation tags that we use to mark certain features in the
target translation:

         Translation:
            - translation alternatives  [intended meaning | literal meaning]
            - correction of typo        =text
            - best guess translation    ((text))

6.0 Notes

For some posts su1 and su2 contain duplicate content. This is due to
subject lines being erroneously included in the translation set. We have
made an effort to exclude this type of duplication in this releases.

Similarly, there is some duplication of sentences across multiple posts in
a thread. We have made an effort to remove identical sentences within a
post prior to translating data of this release.

7.0 Sanity Checks

LDC performed the following corpus-wide checks and corrected all errors
found:

      -- Number of source segments matches number of translation segments
         for all files
      -- All non-blank source segments correspond to non-blank translation
         segments
      -- All translation files have a corresponding source file
      -- All files contain only UTF-8 encoded characters, although they may
         contain non-ascii characters such as Western European characters
      -- Punctuation in translations is ASCII punctuation

8.0 Acknowledgement
This material is based upon work supported by the Defense Advanced Research
Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content
does not necessarily reflect the position or the policy of the Government,
and no official endorsement should be inferred.

--
README Created Zhiyi Song, July 31, 2014