README FILE FOR LDC CATALOG ID: LDC2026T06
TITLE: LORELEI Multiway Translation Corpus
AUTHORS: Jennifer Tracey, Stephanie Strassel, Dave Graff, Jonathan
Wright, Song Chen, Dana Delgado, Michael Arrigo
1.0 Introduction
This corpus is a compilation of parallel text data, consisting of translations
from a fixed set of English texts into 24 languages that were dseveloped by the
Linguistic Data Consortium for the DARPA LORELEI Program. The languages presented
here comprise the full set of Representative Language (RL) Packs created by the
LDC between 2016 and 2019 under the DARPA LORELEI Program, and are tabulated in
section 3 below.
The common set of 100,000 English words translated into every RL contained four
components: approximately 50% general English news documents, 25% LORELEI-domain
English news documents, with the remaining 25% consisting of a phrasebook and
elicitation corpus originally developed for the REFLEX (Research on English and
Foreign Language Exploitation) Program and subsequently updated for LORELEI
(Alvarez et al., 2006). The Elicitation Corpus is designed to elicit linguistic
structures. The phrasebook contains everyday colloquial phrases.
The English texts supplied to the 24 sets of translators fall into three
categories:
* Phrasebook: 1123 phrases drawn from (or typical of) conversational data
* Elicitation: 2600 phrases created to expose grammatical paradigms
* News: at least 190 stories from various text and broadcast news sources
The Phrasebook data set contains a total of 6401 English word tokens (1066
distinct word forms), with phrases ranging between 1 and 29 words each; the
majority of phrases (817 out of 1123) are between 3 and 7 words long.
The Elicitation data set contains a total of 13,896 English word tokens (623
distinct forms), including "placeholder" tokens for proper nouns or other
terms intended to be localized to each language (e.g. "currency_unit",
"male_name_1", "city_name_1", etc.); phrases range between 2 and 13 words
each, with the majority (2188 out of 2600) being between 3 and 7 words. As an
additional feature, most (1433) of the items include a description of context
to guide the translator in choosing particular inflections or grammatical
formations (e.g. phrase: "Who will not let me leave?", context: "Who = one
person; me = one man"). There are many sets of entries where the same phrase
is used but with varied contexts (some phrases are used with as many as 28
different contexts).
The News data includes an overall inventory of 208 English news stories from
various broadcast and print sources. Among the 24 Language Packs, 15 used a
subset of 190 stories, and the other 9 used a slightly different subset of 198
stories; the two subsets have 178 stories in common - that is, 178 stories
have been translated into all 24 languages, another 12 stories have been
translated into 15 of the languages, and a separate group of 20 stories have
been translated into the other 9 languages. See news_lang.tab in the docs
directory for information on which stories are translated into which languages.
Altogether, the 208 stories yield 92,320 word tokens, and 12,544 distinct
forms (including upper-/lower-case distinctions). Each story was presented
to translators as a sequence of sentences, using a format that ensured
sentence-level alignment for every translation.
Seven of the language packs (Arabic, Farsi, Mandarin, Hungarian, Russian,
Spanish, Vietnamese) employed crowd-sourced translation ("Mechanical Turk")
for the Phrasebook data and/or some or most of the news stories (between 40
and 185 stories per language); apart from these, all other data were handled
by professional translators. For the cases where crowd sourcing was used, we
generally have two or more alternative translations for each English file.
See sections 2 and 4 below for more details on crowd-sourced translations.
2.0 Corpus Structure
The corpus directory structure is as follows:
./tools -- contains a subdirectory "ltf2txt", which contains a perl script
for converting "ltf.xml" files to raw text (see section 5)
./dtds -- contains DTD specification files for the "ltf", "psm" and
"cstrans_tab" XML file formats
./docs -- contains this README file; an "elicitation_template.txt", which
lists the 2600 phrases in the Elicitation data set, arranged as
3-line paragraphs (segment-id, phrase, context); and a
news_lang.tab which contains 2 columns - language in column 1,
and news doc-id in column 2, there's one row for every
language/story tuple present in the package
./data -- contains a subdirectory for each of 25 languages (English and the
24 "representative" languages), using the 3-letter ISO-639-3 code
for each language
./data/eng/
elicitation/ -- contains one file, "elicitation_template.tab", with the
2600 elicitation phrases as a tab-delimited table of 3
columns/row: (segment-id, phrase, context if any); this
version contains placeholder strings (male_name_1, etc.)
phrasebook/ -- contains one file, "phrasebook.tab", with 1123 phrases as
a tab-delimited table of two columns (segment-id, phrase)
news/ -- contains two subdirectories, "ltf" and "psm", each having
208 XML files ({FILE_ID}.ltf.xml and {FILE_ID}.psm.xml,
respectively)
./data/{lng}/
elicitation/ -- contains two files:
elicitation.{lng}_done.tab - translated elicitation text (2 columns)
elicitation.eng_for_{lng}.tab - elicitation_template.tab modified with
{lng}-tailored values for placeholders
(3 columns: segment-id, phrase, context)
phrasebook/ -- contains one file, "phrasebook.{lng}*.tab", with 1123
phrases; 2-column tab-delimited table (segment-id, phrase)
news/ -- contains 2 or 3 subdirectories: "ltf" and "psm" are
always present, containing XML data for up to 198 stories
({FILE_ID}.{lng}*.ltf.xml and {FILE_ID}.{lng}*.psm.xml,
respectively)
cstrans_tab/ -- for the 7 languages with crowd-source data, "cstrans_tab"
directory contains a "*.cstrans_tab.xml" file for each
affected English source file, including all translation
versions for each phrase/sentence, along with additional
information (see section 4.3 for details)
In the file-name patterns shown above for translated phrasebook and news
files, asterisks are used to indicate that the patterns are variable. In
particular, for files that originated from crowd-sourced translations, we
append an underscore and a single capital letter to the "{lng}" portion of the
file name to distinguish the multiple translation versions for a given story;
e.g. the "data/ara/news/ltf/" file inventory includes:
NW_AFP_ENG_0011_20030417.ara_A.ltf.xml -\
NW_AFP_ENG_0011_20030417.ara_B.ltf.xml - 3 versions of crowd translation
NW_AFP_ENG_0011_20030417.ara_C.ltf.xml -/
...
NW_AFP_ENG_0012_20030419.ara.ltf.xml -- a single professional translation
Meanwhile, the "data/ara/news/cstrans_tab/" directory includes (among its 186
files):
NW_AFP_ENG_0011_20030417.ara.cstrans_tab.xml
A similar pattern is used for the 5 cases where "phasebook" translations were
crowd-sourced.
Note that all the "*.tab" files under the data/ directory contain data rows
only, with no initial header row. The "ltf" and "psm" XML formats are
explained in section 4 below.
3.0 Content Summary
The following table lists the 24 languages, together with language-specific
details about the data inventory. The phrasebook and elicitition corpus were
professionally translated into all languages, except where noted under
#Crowd_sourced in the table below:
Abbrv Full_name #News #Crowd_sourced
-----------------------------------------------
aka Akan 190 0
amh Amharic 190 0
ara Arabic 190 185 + phrasebook
ben Bengali 196 0
cmn Mandarin 190 183 + phrasebook
fas Farsi 190 40 + phrasebook
hau Hausa 198 0
hin Hindi 190 0
hun Hungarian 190 52
ind Indonesian 190 0
rus Russian 190 185
som Somali 190 0
spa Spanish 190 185 + phrasebook
swa Swahili 190 0
tam Tamil 198 0
tgl Tagalog 198 0
tha Thai 197 0
tur Turkish 198 0
ukr Ukrainian 190 0
uzb Uzbek 198 0
vie Vietnamese 190 40 + phrasebook
wol Wolof 190 0
yor Yoruba 198 0
zul Zulu 190 0
Note that Thai and Bengali translators were supplied with the 198-story news
set, but one or two stories (respectively) remain untranslated.
Also note that among the languages with crowd-sourced translations of news
stories, the same set of 5 stories was always assigned to professional
translators (along with any additional stories where crowd sourcing failed to
produce a usable amount of translation).
4.0 Data Formats
4.1 PSM.xml -- Primary Source Markup Data
The "homogenized" XML format described above preserves the minimum set of tags
needed to represent the structure of the relevant text as seen by the human
web-page reader. When the text content of the XML file is extracted to create
the "rsd" format (which contains no markup at all), the markup structure is
preserved in a separate "primary source markup" (psm.xml) file, which
enumerates the structural tags in a uniform way, and indicates, by means of
character offsets into the rsd.txt file, the spans of text contained within
each structural markup element.
For example, in a discussion-forum or web-log page, there would be a division
of content into the discrete "posts" that make up the given thread, along with
"quote" regions and paragraph breaks within each post. After the HTML has
been reduced to uniform XML, and the tags and text of the latter format have
been separated, information about each structural tag is kept in a psm.xml
file, preserving the type of each relevant structural element, along with its
essential attributes ("post_author", "date_time", etc.), and the character
offsets of the text span comprising its content in the corresponding rsd.txt
file.
4.2 LTF.xml -- Logical Text Format Data
The "ltf.xml" data format is derived from rsd.txt, and contains a fully
segmented and tokenized version of the text content for a given web page.
Segments (sentences) and the tokens (words) are marked off by XML tags (SEG
and TOKEN), with "id" attributes (which are only unique within a given XML
file) and character offset attributes relative to the corresponding rsd.txt
file; TOKEN tags have additional attributes to describe the nature of the
given word token.
The segmentation is intended to partition each text file at sentence
boundaries, to the extent that these boundaries are marked explicitly by
suitable punctuation in the original source data. To the extent that sentence
boundaries cannot be accurately detected (due to variability or ambiguity in
the source data), the segmentation process will tend to err more often on the
side of missing actual sentence boundaries, and (we hope) less often on the
side of asserting false sentence breaks.
The tokenization is intended to separate punctuation content from word
content, and to segregate special categories of "words" that play particular
roles in web-based text (e.g. URLs, email addresses and hashtags). To the
extent that word boundaries are not explicitly marked in the source text, the
LTF tokenization is intended to divide the raw-text character stream into
units that correspond to "words" in the linguistic sense (i.e. basic units of
lexical meaning).
Software is included to convert ltf.xml files to "raw source data" plain text
files ("rsd.txt") -- see section 5 below. The character offsets used in LTF
are based on the "rsd.txt" files, which contain just the text that is visible
to a person reading the original source, with normalized white-space
characters (including line breaks), but without markup of any kind.
4.3 CSTRANS_TAB.xml -- Crowd-source Translation Tables
The "./data/{lng}/cstrans_tab/" directories contain one "*.cstrans_tab.xml"
file for each English source file that was submitted to translation via crowd
sourcing. Each file contains a DOC element (with "id" and "lang" attributes),
which in turn contains a "SEG" element for each "SEG" in the corresponding
English ltf.xml file. Each "SEG" element may either be an empty tag (if no
usable translations were submitted for the given segment), or contain one or
more "TR" elements, each of which is an alternative translation for the given
source segment. In either case, the "SEG" tag has an "id" attribute (unique
within the given xml file, matching the SEG "id" value in ltf.xml), and an
"ntrs" attribute (whose value is the number of "TR" elements present). For
example:
...
...
The attributes of the "TR" elements are as follows:
- translatorid -- an alphanumeric string unique to each contributor; note
that each translation "version" (_A, _B, etc) is likely to contain segments
from different translators
- avg_gold_ter may be floating-point numeric or "Unk"; it represents the
"term error rate" relative to a "gold-standard" manual translation (lower
value == better match)
- score may be floating-point numeric or "None"
- mt_ter is always floating-point numeric; it represents the "machine
translation error rate" relative to a "google-translate" reference (lower
value == better match)
- nonwhitesp and odd_ch are always integer numerics: the count of
non-whitespace characters in the string, and the count of characters that
are "not in the expected language" (this can include emoticons,
non-printing characters, and characters in foreign scripts).
5.0 Software: ltf2txt
A data file in ltf.xml format (as described above) can be conditioned to
recreate exactly the "raw source data" text stream (the rsd.txt file) from
which the LTF was created. The tool described here can be used to apply that
conditioning to a directory containing ltf.xml data. The script validates
each output rsd.txt stream by comparing its MD5 checksum against the reference
MD5 checksum of the original rsd.txt file from which the LTF was created.
(This reference checksum is stored as an attribute of the "DOC" element in the
ltf.xml structure; there is also an attribute that stores the character count
of the original rsd.txt file.)
The script -- ltf2esd.perl -- contains user documentation as part of the
script content; you can run "perldoc" to view the documentation as a typical
unix man page, or you can simply view the script content directly by whatever
means to read the documentation. Also, running the script without any
command-line arguments will cause it to display a one-line synopsis of its
usage, and then exit.
6.0 Differences Relative to Earlier Language Pack Releases
In the process of designing and preparing the present corpus, we have taken
steps to simplify the format and improve the usability of the Phrasebook and
Elicitation data. As part of that process, we also resolved some problems
in certain data files as originally released in a few of the Representative
Language Pack corpora.
Regarding both Elicitation and Phrasebook data, the original Language Pack
releases presented these in "ltf.xml" format (with associated "psm.xml"
files). Owing to the quantity of segments in these files, the absence of any
structure above the level segments (e.g. paragraph or discussion-forum comment
boundaries), and the importance of comparing data across multiple languages,
we have converted these files to a simple, tab-delimited table format.
For Elicitation data, we provide both a "template" file (under "data/eng/")
and the version of that template adapted for translation into each language --
i.e.:
data/eng/elicitation/elicitation_template.tab
data/{lng}/elicitation/elicitation.eng_for_{lng}.tab
In the original Language Packs, the file containing the language-adapted
version of the English phrases was presented in ltf.xml format, and did not
include the contextual information provided to translators (some Packs didn't
include the unadapted English template file, so the contextual data were
entirely missing). By presenting the data in tabular format, we are now able
to include the context as a third column in each of the language-adapted
English files. (The translated elicitation files contain only two columns,
excluding the context data.)
The conversion of elicitation data from ltf.xml to tabular format also
uncovered some problems in four of the original Language Packs: in Swahili,
Tamil, Thai and Wolof, one or two segments were missing from the translated
ltf.xml data, and because these missing segments were somewhere in the middle
of the 2600-segment sequence, a processing error had caused all subsequent
segments to be misaligned relative to the English file. In two of the data
sets, the missing segments were present in the original files received from
translators (but had been excluded from the release due to minor formatting
problems); these are being included for the first time in the present release.
In the other two sets, the translated elicitiation table file now has an empty
cell in column 2 for the affected segments, and all subsequent segments in the
file are properly aligned with the English data. The details can be
summarized as follows:
Language Problem Resolution
------------------------------------------
swa/Swahili 2 segs missing empty translation cells in 2 rows
tam/Tamil 1 seg missing empty translation cell in 1 row
tha/Thai 1 seg missing translation found and included
wol/Wolof 2 segs missing translations found and included
One additional problem affected the Wolof elicitation data: an issue involving
character encoding had caused a few hundred Wolof word tokens to contain "?"
(the ASCII question-mark character), where the correct orthography should have
had "ng"; these have all been corrected.
No such issues came up in the phrasebook or news data. News and "cstrans_tab"
XML files (with crowd-sourcing details) have all been copied without
modificiation from the original Packs into the current release.
7.0 Acknowledgements
This material is based upon work supported by the Defense Advanced Research
Projects Agency (DARPA) under Contract No. HR0011-15-C-0123. Any opinions,
findings and conclusions or recommendations expressed in this material are
those of the author(s) and do not necessarily reflect the views of DARPA.
8.0 Citations/References
Alison Alvarez, Lori Levin, Robert Frederking, Si-mon Fung, Donna Gates,
Jeff Good (2006). The MILE Corpus for Less Commonly Taught Languages. In
Proceedings of the Human Language Technology Conference of the NAACL,
Companion Volume: Short Papers. Association for Computational Linguistics,
pages 5-8.
Stephanie Strassel, Jennifer Tracey (2016). LORELEI Language Packs: Data,
Tools, and Resources for Technology Development in Low Resource Languages.
In: Proceedings of LREC 2016: 10th Edition of the Language Resources and
Evaluation Conference, Portorož, Slovenia, May 23-28.
Jennifer Tracey, Stephanie Strassel, Ann Bies, Zhiyi Song, Michael Arrigo,
Kira Griffitt, Dana Delgado, Dave Graff, Seth Kulick, Justin Mott and Neil
Kuster (2019). Corpus Building for Low Resource Languages in the DARPA
LORELEI Program. In: Proceedings of LoResMT 2019: 2nd Workshop on Technologies
for MT of Low Resource Languages (at MT Summit XVII), Dublin, Ireland, August 20.
Jennifer Tracey, Stephanie Strassel (2020). Basic Language Resources for 31
Languages (Plus English): The LORELEI Representative and Incident Language Packs.
In: Proceedings of the Language Resources and Evaluation Conference, LREC 2020,
Marseille, France, postponed from May 16-20.
9.0 Copyright
© 2002-2007, 2009-2010 Agence France Presse,
© 2000 American Broadcasting Company, © 2000 Cable News Network LP, LLLP,
© 2008 Central News Agency (Taiwan), © 1989 Dow Jones & Company, Inc.,
© 2005 Los Angeles Times - Washington Post News Service, Inc.,
© 2000 National Broadcasting Company, Inc., © 1999, 2005, 2006, 2010 New York Times,
© 2000 Public Radio International, © 2003, 2005-2008, 2010 The Associated Press,
© 2003, 2005-2008 Xinhua News Agency, © 2014 Trustees of the University of Pennsylvania
10.0 CONTACTS
Ann Bies -