GALE Phase 3 and 4 Arabic BN Parallel Text Part 2
Authors: Zhiyi Song, Gary Krug, Stephanie Strassel

1.0 Introduction

This file contains documentation for GALE Phase 3 and 4 Arabic BN Parallel Text Part 2.

Along with other corpora, the parallel text in this release comprised
training data for Phase 3 and 4 of the DARPA GALE Program. This corpus
contains Modern Standard Arabic (MSA) source text and corresponding English
translations for 166,081 tokens, selected from Broadcast News (BN) data
collected and transcribed in GALE. The audio corresponding to the source
files in this
release are distributed separately.

2.0 Package Structure

This package comprises two directories:

data/

The data directory is divided into the "source" directory and "translation
directory.  The "source" directory contains files in the source language.
The "translation" directory contains the translated files.

File names refer to the data source, source language and collection date,
e.g.

{SRC}_{PRG}_{LNG}_YYYYMMDD_hhmmss(.fileTypeExtension)

     where - {SRC} is the source ID (e.g., CNN, VOA, etc.)
           - {PRG} is the program ID (e.g., LARRYKING, etc.)
           - {LNG} is the three-letter language ID defined in the
             ISO639-3.  ARB is Standard Arabic; CMN is Chinese,
             Mandarin; ENG is English.
           - YYYYMMDD is the data collection (broadcast) date.
           - hhmmss   is the start time of the program (hh is the hour
             in the 24-hour format)

File stems for source and translation file pairs are the same. Source files
use the .tdf extension, while translation files use the eng.tdf extension.

docs/

The docs directory contains documentation related to the release.

           docs/README.txt - this file
           docs/doc_list.txt - inventory of source and translation files
           with the token count for each file
           docs/file_list.txt - inventory of files in this release
           docs/GALE_Arabic_Translation_Guidelines_V2_7.pdf - translation
           guidelines
           docs/GALE_TranscriptionTranslationMarkup_V25.xls - explanation
           of special symbols
           docs/program_summary.txt -- audio programs in this release
           docs/TDF_format.txt -- TDF format description

3.0 Contents

This release includes 45 source-translation document pairs, comprising 166,081 words of translated data. Data is drawn from 23 distinct
Arabic broadcast news (BN) sources. 

The following table is a summary of the files by data source included in
this release.

      Source       Program         Epoch         tokens
      ABUDHABI     ABUDHNEWS2      2007.02       2954
      ABUDHABI     ABUDHNEWS       2007.01       5569
      ABUDHABI     NEWSHOUR        2008.04       5394
      ALAM         NEWSRPT         2007.01       6572
      ALAM         NEWSRPT         2007.01       6113
      ALBAGHDADYA  BAGHDADYANEWS   2008.04       3389
      ALHURRA      THEWORLDNOW     2008.01       3203
      ALURDUNYA    URDUNYANEWS     2007.03       1794
      ARABIYA      ALARABIYANEWS2  2007.03       2762
      ARABIYA      ALARABIYANEWS2  2007.03       2841
      ARABIYA      LATEHRNEWS      2007.02       2064
      ARABIYA      PANORAMA        2007.02       6165
      ARABIYA      PANORAMA        2007.03       5907
      ARABIYA      PANORAMA        2007.03       6157
      ARABIYA      PANORAMA        2007.03       5923
      ARABIYA      PANORAMA        2008.03       6436
      DUBAI        DUBAINEWS2      2007.02       2821
      DUBAI        DUBAINEWS2      2007.03       2734
      DUBAI        DUBAINEWS2      2007.03       2983
      IRAQIYAH     ECONRPT         2007.01       1971
      IRAQIYAH     ECONRPT         2007.01       1423
      IRAQIYAH     ECONRPT         2007.02       1572
      IRAQIYAH     IRAQINEWS       2008.03       3528
      IRAQIYAH     IRAQINEWS       2008.03       2187
      IRAQIYAH     IRAQINEWS       2008.03       3245
      IRAQIYAH     IRAQINEWS       2008.03       3696
      IRAQIYAH     IRAQTDY         2007.03       1813
      KUWAITTV     NEWS            2007.01       2659
      KUWAITTV     NEWS            2007.02       3178
      LBC          NEWS            2007.01       5086
      LBC          NEWS            2007.02       3353
      OMANTV       NEWS            2008.03       3021
      SAUDITV      SAUDINEWS2      2008.03       5174
      SAWA         SAWANEWS        2008.01        641
      SAWA         SAWANEWS        2008.02       1897
      SAWA         SAWANEWS        2008.02       2720
      SAWA         SAWANEWS        2008.03        737
      SCOLA        JORDNNSCO       2007.01       3386
      SCOLA        JORDNNSCO       2007.03       2999
      SCOLA        SAUDNNSCO       2007.02       2977
      SYRIANTV     NEWS25          2007.01       4851
      SYRIANTV     NEWS25          2007.01       6899
      SYRIANTV     NEWS25          2007.02       5697
      SYRIANTV     NEWS25          2007.02       5247
      SYRIANTV     NEWS25          2007.02       4343

Token counts are expressed in terms of words for Arabic (using the regular
expression w+) and are taken from the source data.

The file called docs/file_list.txt contains a complete list of files in the
package. The file docs/doc_list.txt contains the inventory of documents
with the token count for each file.

3.1 TDF Format

TDF files are tab-delimited text files containing one segment of text along
with meta information about that segment. Each field in the TDF file is
described in docs/TDF_format.txt.

A source TDF file and its translation are the same except that the
transcript in the source TDF is replaced by its English translation.

3.2 Encoding

All data are encoded in UTF8.

4.0 Translation Pipeline

A manual selection procedure was used to choose data appropriate for
translation and distribution to GALE. Selection criteria included
linguistic features (is the file in MSA), transcription features (is the
transcription good enough to produce a viable translation) and topic
features (does the file contain news, current events or human interest
topics).

Before audio files can be translated, they must be transcribed. The files
in this release were transcribed by LDC staff and/or transcription vendors
under contract to LDC. In addition to producing a verbatim transcript,
transcribers also indicate sentence boundaries. Sentence boundaries and
overall transcript quality were verified by LDC staff before sending files
out for translation.

After transcription and SU annotation, files were reformatted into a human-
readable translation format and were assigned to professional translators
for careful translation. Translators followed LDC's GALE Translation
guidelines, which describe the makeup of the translation team, the source
data format, the translation data format, best practices for translating
certain linguistic features (such as names and speech disfluencies), and
quality control procedures applied to completed translations. Transcribers
and translators used special markup to indicate particular linguistic
features, for instance unintelligible speech, partial words and typos in
the transcript; these uses are described in the documentation accompanying
this release.

After translations were completed, bilingual LDC staff performed
quality control by selecting a proportional sample from each delivery and
scrutinizing it for several kinds of mistakes, as described in the
translation guidelines. Low quality translations were returned to
the translators for revision.  After quality control is complete,
translation files were validated and reformatted into the release format.

5.0 Sanity Checks

LDC performed the following corpus-wide checks and corrected all errors
found:

      -- Number of source segments matches number of translation segments
         for all files (except full source text)
      -- Timestamps are identical between selected source and translation
      -- All non-blank source segments correspond to non-blank translation
         segments
      -- All translation files have a corresponding full source file
         selected source file, and index file
      -- All files contain only UTF-8 encoded characters, although they may
         contain non-ascii characters such as Western European characters
      -- Punctuation in translations is ASCII punctuation

6.0 Acknowledgement

This work was supported in part by the Defense Advanced Research Projects
Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this
publication does not necessarily reflect the position or the policy of the
Government, and no official endorsement should be inferred.

7.0 Content Copyright

[To be supplied by Publications/IPR]

----
README Created 29 June, 2011 Gary Krug
       Updated 31 January, 2011 Zhiyi Song
       Updated 20 February, 2012 Stephanie Strassel
       Updated 30 April, 2012 Zhiyi Song