README FILE FOR LDC CATALOG ID: LDC2020T24

TITLE: LORELEI Ukrainian Representative Language Pack

AUTHORS: Jennifer Tracey, Stephanie Strassel, Dave Graff, Jonathan
         Wright, Song Chen, Neville Ryant, Xiaoyi Ma, Seth Kulick,
         Dana Delgado, Michael Arrigo


1.0 Introduction

This corpus was developed by the Linguistic Data Consortium for the
DARPA LORELEI Program and consists of over 111 million words of
monolingual Ukrainian text, approximately 700,000 words of which are
translated into English.  Another 86,000 Ukrainian words are also
translated from English data, and an additional 174,000 words of found
parallel text and over 2,000,000 words of comparable text are
provided. Approximately 75,000 words are annotated for named entities,
and up to 50,000 words with several additional types of annotation
(full entity including nominals and pronouns, situation frame
annotation, and entity linking). Details of data volumes for each type
of annotation are provided in section 3 of this README.

The LORELEI (Low Resource Languages for Emergent Incidents) Program is
concerned with building Human Language Technology for low resource
languages in the context of emergent situations like natural disasters
or disease outbreaks. Linguistic resources for LORELEI include
Representative Language Packs for over 2 dozen low resource languages,
comprising data, annotations, basic natural language processing tools,
lexicons and grammatical resources. Representative languages are
selected to provide broad typological coverage, while Incident
Languages are selected to evaluate system performance on a language
whose identity is disclosed at the start of the evaluation, and for
which no training data has been provided.

This corpus comprises the complete set of monolingual and parallel
text, lexicon, annotations, and tools from the LORELEI Ukrainian
Representative Language Pack. Because the Ukrainian language pack
began as an Incident Language Pack, there are a few unusual features
in this corpus compared to other Representative Language Packs in
LORELEI:

1. The convention in LORELEI was to refer to Incident Languages by a
numeric id rather than the language name, so the language code portion
of the document ids use IL4 instead of UKR.

2. Comparable text was included in Incident Language Packs when
sufficient existing parallel text could not be found, but is generally
not included in Representative Language Packs. Because we had already
created comparable text for Ukrainian, we included it in this
Representative Language Pack.

3. No standard LORELEI lexicon is provided in this Representative
Language Pack; however, a simple translation lexicon (bilingual
wordlist) and pointers to additional lexical and grammatical resources
such as are usually found in an Incident Langauge Pack are provided.

4. Two independent professional translations are provided for the set
of documents that would have been used as references for the
evaluation set if this were configured as an Incident Language
Pack. Other professionally translated documents that were added to
meet the target volume for a Representative Language have only one
translation.

For more information about LORELEI language resources, see
https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2016-lorelei-language-packs.pdf.


2.0 Corpus organization

2.1 Directory Structure

The directory structure and contents of the package are summarized
below -- paths shown are relative to the base (root) directory of the
package:

./docs/README.txt  -- this file

./dtds/
./dtds/clusters.v1.0.dtd
./dtds/ltf.v1.5.dtd
./dtds/psm.v1.0.dtd
./dtds/sentence_alignment.v1.0.dtd
./dtds/laf.v1.2.dtd

./docs/  -- various tables and listings (see section 9 below)
./docs/annotation_guidelines/ -- guidelines for all annotation tasks included in this corpus
./docs/grammatical_sketch/ -- grammatical sketch of Ukrainian
./docs/categoryI_dictionary/ -- translation lexicon and pointers to online bilingual dictionary resources for Ukrainian-English
./docs/categoryII/ -- pointers to online resources for grammars, dictionaries, and gazetteers for Ukrainian, and a gazetteer of Ukraine from www.geonames.org

./tools/ -- see section 8 below for details about tools provided
./tools/ldclib/
./tools/ltf2txt/
./tools/sent_seg/
./tools/tokenization_parameters.v5.0.yaml
./tools/il4/

./data/monolingual_text/zipped/ -- zip-archive files containing
                                   monolingual "ltf" and "psm" data

./data/translation/
   comparable/{il4,eng,clusters} -- comparable text with document-level clustering
   found/{il4,eng,sentence_alignment} -- found parallel text
                                         with sentence alignments between the
                                         Ukrainian and English documents
   from_il4/{il4,eng}/     -- translations from Ukrainian to English
   from_eng/               -- translations from English to Ukrainian
    {elicitation,news,phrasebook}/    for each of three types of English data:
     {il4,eng}/                   for each language in each direction,
                                  "ltf" and "psm" directories contain
                                  corresponding data files

./data/annotation/ -- see section 5 below for details about annotation
./data/annotation/entity/
./data/annotation/situation_frame/
./data/annotation/twitter_tokenization/

2.2 File Name Conventions

There are 19 *.ltf.zip files in the monolingual_text/zipped directory,
together with the same number of *.psm.zip files.  Each {ltf,psm}.zip file
pair contains an equal number of corresponding data files.  The "file-ID"
portion of each zip file name corresponds to common substrings in the file
names of all the data files contained in that archive.  For example:

./data/monolingual_text/zipped/IL4_DF_G0040F.ltf.zip contains:
      ltf/IL4_DF_020072_20030212_G0040FIVA.ltf.xml
      ltf/IL4_DF_020072_20030213_G0040FL5C.ltf.xml
      ...

./data/monolingual_text/zipped/IL4_DF_G0040F.psm.zip contains:
      psm/IL4_DF_020072_20030212_G0040FIVA.psm.xml
      psm/IL4_DF_020072_20030213_G0040FL5C.psm.xml
      ...

The file names assigned to individual documents within the zip archive
files provide the following information about the document:

   Language  3-letter abbrev.
   Genre     2-letter abbrev.
   Source    6-digit numeric ID assigned to data provider
   Date      8-digit numeric: YYYYMMDD year, month, day)
   Global-ID 9-digit alphanumeric assigned to this document

Those five fields are joined by underscore characters, yielding a
32-character file-ID; three portions of the document file-ID are used
to set the name of the zip file that holds the document: the Language
and Genre fields, and the first 6 digits of the Global-ID.

The 2-letter codes used for genre are as follows:

   DF -- discussion forum
   NW -- news
   RF -- reference (e.g. Wikipedia)
   SN -- social network (Twitter)
   WL -- web-log


3.0 Content Summary

3.1 Monolingual Text

Genre   #Docs     #Tokens
NW      145,836   31,442,683
DF      113,661   80,782,078
WL           44       29,270
SN       14,025      193,436

Note that the SN (Twitter) data cannot be distributed directly by LDC, due to
the Twitter Terms of Use.  The file "docs/twitter_info.tab" (described in
Section 8.0 below) provides the necessary information for users to fetch the
particular tweets directly from Twitter.

3.2 Parallel Text

Type        Genre   #Docs   #Segs     #Wrds
---
Comparable  DF        831   209,743   2,465,811
---
Found       NW        413     8,044     169,888
---
FromEng     NW        190     3,912      71,152
FromEng     EL          2     3,723      15,606
---
ToEng       DF        772    36,375     348,951
ToEng       NW      1,279    21,659     319,216
ToEng       WL         24       999      11,291
---

3.3 Annotation

Type            Genre   #Docs      #Segs    #Wrds
---
SituationFrame  DF         20        725     6,997
SituationFrame  NW         53      1,035    16,226
SituationFrame  SN      1,325      1,325    17,478
SituationFrame  WL         14        679      7562
---
EntityFull      DF         19        573     5,516
EntityFull      NW         53        969    15,307
EntityFull      WL         10        314     3,783
---
EntitySimp      DF         43      1,687    16,196
EntitySimp      NW         84      1,589    24,983
EntitySimp      SN      1,717      1,717    22,502
EntitySimp      WL         15        771     8,542
---
EntityLinking   DF         20        725     6,997
EntityLinking   NW         53      1,035    16,226
EntityLinking   SN      1,325      1,325    17,478
EntityLinking   WL         14        679     7,562
---


4.0 Data Collection and Parallel Text Creation

Both monolingual text collection and parallel text creation involve a
combination of manual and automatic methods. These methods are
described in the sections below.

4.1 Monolingual Text Collection

Data is identified for collection by native speaker "data scouts," who
search the web for suitable sources, designating individual documents
that are in the target language and discuss the topics of interest to
the LORELEI program (humanitarian aid and disaster relief). Each
document selected for inclusion in the corpus is then harvested, along
with the entire website when suitable. Thus the monolingual text
collection contains some documents which have been manually selected
and/or reviewed and many others which have been automatically
harvested and were not subject to manual review.

4.2 Parallel Text Creation

Parallel text for LORELEI was created using three different methods,
and each LORELEI language may have parallel text from one or all of
these methods. In addition to translation from each of the LORELEI
languages to English, each language pack contains a "core" set of
English documents that were translated into each of the LORELEI
Representative Languages. These documents consist of news documents, a
phrasebook of conversational sentences, and an elicitation corpus of
sentences designed to elicit a variety of grammatical structures. All
translations are aligned at the sentence level. For professional and
crowdsourced translation, the segments align one-to-one between the
source and target language (i.e. segment 1 in the English aligns with
segment 1 in the source language). For found parallel text, automatic
alignment is performed and a separate alignment file provides
information about how the segments in the source and translation are
aligned.

Professionally translated data has one translation for each source
document, while crowdsourced translations have up to four translations
for each source document, designated by A, B, C, or D appended to the
file name on the multiple translation versions.

4.3 Comparable Text Creation

LDC used the results from two clustering techniques:

    (1) Kutuzov et al. (https://arxiv.org/abs/1604.05372) for
            multilingual document clustering on English and Ukrainian.

    (2) Cosine similarity for monolingual document clustering on
            English that was later augmented with Ukrainian documents.

For both approaches, the data was run on the tokenization of the
documents as found in the LTF versions.  The documents were divided
into different sets, where each set includes all documents with dates
that span two weeks (the weeks do not overlap).  The final comparable
text clusters consist of English and Ukrainian documents that were clustered
from both approaches that fall within the same time period.

Note: Some documents may appear in multiple clusters.

The cluster files have names patterned as follows:

    YYYY-MM-DD_YYYY-MM-DD.clusters.xml

where the dates represent the beginning and end dates of the time span
during which the data files in that cluster were authored. The xml
structure in each cluster file consists of one or more "cluster"
elements, each of which contains some quantity of "doc" elements from
each language.


5.0 Annotation

Four types of annotation are present in this corpus. Simple Named
Entity tags names of persons, organizations, geopolitical entities,
and locations (including facilities), while Full Entity also tags
nominal and pronominal mentions of entities. Entity Discovery and
Linking provides cross-document coreference of named entities via
linking to an external knowledge base (the knowledge base used for
LORELEI is released separately as LDC2020T10). Situation Frame annotation labels
the presence of needs and issues related to emergent incidents such as
natural disasters (e.g. food need, civil unrest), along with
information such as location, urgency, and entities involved in
resolving the needs. Details about each of these
annotation tasks can be found in docs/annotation_guidelines/.


6.0 Data Processing and Character Normalization for LORELEI

Most of the content has been harvested from various web sources using
an automated system that is driven by manual scouting for relevant
material.  Some content may have been harvested manually, or by means
of ad-hoc scripted methods for sources with unusual attributes.

All harvested content was initially converted from its original HTML
form into a relatively uniform XML format; this stage of conversion
eliminated irrelevant content (menus, ads, headers, footers, etc.),
and placed the content of interest into a simplified, consistent
markup structure.

The "homogenized" XML format then served as input for the creation of
a reference "raw source data" (rsd) plain text form of the web page
content; at this stage, the text was also conditioned to normalize
white-space characters, and to apply transliteration and/or other
character normalization, as appropriate to the given language.


7.0 Overview of XML Data Structures

7.1 PSM.xml -- Primary Source Markup Data

The "homogenized" XML format described above preserves the minimum set
of tags needed to represent the structure of the relevant text as seen
by the human web-page reader.  When the text content of the XML file
is extracted to create the "rsd" format (which contains no markup at
all), the markup structure is preserved in a separate "primary source
markup" (psm.xml) file, which enumerates the structural tags in a
uniform way, and indicates, by means of character offsets into the
rsd.txt file, the spans of text contained within each structural
markup element.

For example, in a discussion-forum or web-log page, there would be a
division of content into the discrete "posts" that make up the given
thread, along with "quote" regions and paragraph breaks within each
post.  After the HTML has been reduced to uniform XML, and the tags
and text of the latter format have been separated, information about
each structural tag is kept in a psm.xml file, preserving the type of
each relevant structural element, along with its essential attributes
("post_author", "date_time", etc.), and the character offsets of the
text span comprising its content in the corresponding rsd.txt file.

7.2 LTF.xml -- Logical Text Format Data

The "ltf.xml" data format is derived from rsd.txt, and contains a
fully segmented and tokenized version of the text content for a given
web page.  Segments (sentences) and the tokens (words) are marked off
by XML tags (SEG and TOKEN), with "id" attributes (which are only
unique within a given XML file) and character offset attributes
relative to the corresponding rsd.txt file; TOKEN tags have additional
attributes to describe the nature of the given word token.

The segmentation is intended to partition each text file at sentence
boundaries, to the extent that these boundaries are marked explicitly
by suitable punctuation in the original source data.  To the extent
that sentence boundaries cannot be accurately detected (due to
variability or ambiguity in the source data), the segmentation process
will tend to err more often on the side of missing actual sentence
boundaries, and (we hope) less often on the side of asserting false
sentence breaks.

The tokenization is intended to separate punctuation content from word
content, and to segregate special categories of "words" that play
particular roles in web-based text (e.g. URLs, email addresses and
hashtags).  To the extent that word boundaries are not explicitly
marked in the source text, the LTF tokenization is intended to divide
the raw-text character stream into units that correspond to "words" in
the linguistic sense (i.e. basic units of lexical meaning).

7.3 LAF.xml -- Logical Annotation Format Data

The "laf.xml" data format provides a generic structure for presenting
annotations on the text content of a given ltf.xml file; see the associated
DTD file in the "dtds" directory.  Note that each type of annotation (simple
named entity, full entity, semantic structure, NP chunking) uses the basic XML
elements of LAF in different ways.

NB: For Twitter data, the LDC interprets the Twitter Terms of Use to
mean that no original text content from tweets may be redistributed as
part of an LDC corpus.  Therefore, the EXTENT elements of
*_SN_000370_*.laf.xml files are presented here with underscore
characters ('_') in place of all non-white-space characters in
annotated strings.  In order to get the actual text content for these
strings, users must download and process each tweet into plain-text
format (using the software provided in the "tools" directory or
equivalent), and use the character-offset information in the EXTENT
tag to acquire the annotated string.  A correct result from this
process can only be assured if the user's plain-text file for the
given tweet has an MD5 signature that matches the one given in the
corresponding ltf.xml file. In order to ensure that users can match
the tokenization present in the annotated version of any Twitter data,
a version of the ltf.xml files for annotated tweets with underscore
characters ('_') in place of all non-white-space characters is
provided in the data/annotation/twitter_tokenization/ directory.

7.4 Situation Frame Annotation Tables

Situation frame annotation consists of three parts, each presented as a
separate tab-delimited file: entities, needs, and issues. The details of each
table are described below.

Entities, mentions, need frames, and issue frames all have IDs that follow a
standard schema consisting of a prefix designating the type of ID ('Ent' for
entities, 'Men' for mentions, and 'Frame' for both need and issue frames), an
alphanumeric string identifying the annotation "kit", and a numeric string
uniquely identifying the specific entity, mention, or frame within the
document.

7.4.1 Mentions

The grouping of entity mentions into "selectable entities" for situation frame
annotation is provided in the mentions/ subdirectory. The table has 8 columns
with the following headers and descriptions:

column 1: doc_id -- doc ID of source file for the annotation
column 2: entity_id -- unique identifier for each grouped entity
column 3: mention_id -- unique identifier for each entity mention
column 4: entity_type -- one of PER, ORG, GPE, LOC
column 5: mention_status -- 'representative' or 'extra';
          representative mentions are the ones which have been chosen by the
          annotator as the representative name for that entity. Each entity
          has exactly one representative mention.
column 6: start_char -- character offset for the start of the mention
column 7: end_char -- character offset for the end of the mention
column 8: mention_text -- mention string

7.4.2 Needs

Annotation of need frames is provided in the needs/ subdirectory. Each row in
the table represents a need frame in the annotated document. The table has 13
columns with the following headers and descriptions:

column 1: user_id -- user ID of the annotator
column 2: doc_id -- doc ID of source file for the annotation
column 3: frame_id -- unique identifier for each frame
column 4: frame_type -- 'need'
column 5: need_type -- exactly one of 'evac' (evacuation), 'food' (food
          supply), 'search' (search/rescue), 'utils' (utilities, energy, or
          sanitation), 'infra' (infrastructure), 'med' (medical assistance),
          'shelter' (shelter), or 'water' (water supply)
column 6: place_id -- entity ID of the LOC or GPE entity identified as the
          place associated with the need frame; only one place value per
          need frame, must match one of the entity IDs in the corresponding
          ent_output.tsv or be 'none' (indicating no place was named)
column 7: proxy_status -- 'True' or 'False'
column 8: need_status -- 'current', 'future'(future only), or 'past' (past only)
column 9: urgency_status -- 'True' (urgent) or 'False' (not urgent)
column 10: resolution_status -- 'sufficient' or 'insufficient' (insufficient /
           unknown sufficiency)
column 11: reported_by -- entity ID of one or more entities reporting
           the need; multiple values are comma-separated, must match entity IDs
           in the corresponding ent_output.tsv or be 'none'
column 12: resolved_by -- entity ID of one or more entities resolving
           the need; multiple values are comma-separated, must match entity IDs
           in the corresponding ent_output.tsv or be 'none'
column 13: description -- string of text entered by the annotator as
           memory aid during annotation, no requirements for content or language,
           may be 'none'

7.4.3 Issues

Annotation of issue frames is provided in the issues/ subdirectory.  Each row
in the table represents an issue frame in the annotated document. The table has
9 columns with the following headers and descriptions:

column 1: user_id -- user ID of the annotator
column 2: doc_id -- doc ID of source file for the annotation
column 3: frame_id -- unique identifier for each frame
column 4: frame_type -- 'issue'
column 5: issue_type -- exactly one of 'regimechange' (regime change),
          'crimeviolence' (civil unrest or widespread crime), or 'terrorism'
          (terrorism or other extreme violence)
column 6: place_id -- entity ID of the LOC or GPE entity identified as
          the place associated with the issue frame; only one place value per
          issue frame, must match one of the entity IDs in the corresponding
          ent_output.tsv or be 'none'
column 7: proxy_status -- 'True' or 'False'
column 8: issue_status -- 'current' or 'not_current'
column 9: description -- string of text entered by the annotator as
          memory aid during annotation, no requirements for content or
          language, may be 'none'

7.5 EDL Table

The "./data/annotation/entity/" directory contains the file "il4_edl.tab", which
has an initial "header" line of column names followed by data rows with 8
columns per row.  The following shows the column headings and a sample value
for each column:

column 1: system_run_id   LDC
column 2: mention_id      Men-IL4_DF_020072_20150611_G0040F6BR-7
column 3: mention_text    Василькові
column 4: extents         IL4_DF_020072_20150611_G0040F6BR:22-31
column 5: kb_id           690405
column 6: entity_type     GPE
column 7: mention_type    NAM
column 8: confidence      1.0

When column 5 is fully numeric, it refers to a numbered entity in the
LORELEI Entity Detection and Linking Knowledge Base (distributed separately 
as LDC2020T10).  Note that a given mention may be ambiguous as to the 
particular KB element it represents; in this case, two or more numeric KB_ID
values will appear in column 5, separated by the vertical-bar character (|).

When column 5 consists of "NIL" plus digits, it refers to an entity that is
not present in the Knowledge Base, but this label is used consistently for all
mentions of the particular entity.


8.0 Software tools included in this release

8.1 "ltf2txt" (source code written in Perl)

A data file in ltf.xml format (as described above) can be conditioned
to recreate exactly the "raw source data" text stream (the rsd.txt
file) from which the LTF was created.  The tools described here can be
used to apply that conditioning, either to a directory or to a zip
archive file containing ltf.xml data.  In either case, the scripts
validate each output rsd.txt stream by comparing its MD5 checksum
against the reference MD5 checksum of the original rsd.txt file from
which the LTF was created.  (This reference checksum is stored as an
attribute of the "DOC" element in the ltf.xml structure; there is also
an attribute that stores the character count of the original rsd.txt
file.)

Each script contains user documentation as part of the script content;
you can run "perldoc" to view the documentation as a typical unix man
page, or you can simply view the script content directly by whatever
means to read the documentation.  Also, running either script without
any command-line arguments will cause it to display a one-line
synopsis of its usage, and then exit.

   ltf2rsd.perl -- convert ltf.xml files to rsd.txt (raw-source-data)

   ltfzip2rsd.perl -- extract and convert ltf.xml files from zip archives

8.2 ldclib -- general text conditioning, twitter harvesting

The "bin/" subdirectory of this package contains three executable scripts
(written in Ruby):

   create_rsd.rb -- convert general xml or plain-text formats to "raw source
                    data" (rsd.txt), by removing markup tags and applying
                    sentence segmentation

   token_parse.rb -- convert rsd.txt format into ltf.xml

   get_tweet_by_id.rb -- download and condition Twitter data

Due to the Twitter Terms of Use, the text content of individual tweets
cannot be redistributed by the LDC.  As a result, users must download
the tweet contents directly from Twitter and condition/normalize the
text in a manner equivalent to what was done by the LDC, in order to
reproduce the Ukrainian raw text that was used by LDC for annotation (to
be released separately).  The twitter-processing software provided in
the tools/ directory enables users to perform this normalization and
ensure that the user's version of the tweet matches the version used
by LDC, by verifying that the md5sum of the user-downloaded and
processed tweet matches the md5sum provided in the twitter_info.tab
file. Users must have a developer account with Twitter in order to
download tweets, and the tool does not replace or circumvent the
Twitter API for downloading tweets.

The ./docs/twitter_info.tab file provides the twitter download id for each
tweet, along with the LORELEI file name assigned to that tweet and the
md5sum of the processed text from the tweet.

The file "README.md" in this directory provides details on how to install and
use the source code in this directory in order to condition text data that the
user downloads directly from Twitter and produce both the normalized raw text
and the segmented, tokenized LTF.xml output.

All LDC-developed supporting files (models, configuration files, library
modules, etc.) are included, either in the "lib" subdirectory (next to
"bin"), or else in the parent ("tools") directory.

Please refer to the README.md file that accompanies this software package.

8.3 sent_seg -- apply sentence segmentation to raw text

The Python tools in this directory are used as part of the conditioning done
by "create_rsd.rb" in the "ldclib" package.  Please refer to the README.rst
file included with the package.

8.4 ne-tagger -- Named-Entity tagger for Ukrainian

Please refer to the ./tools/il4/ne-tagger/README.rst file for information about
usage and performance.


9.0 Documentation included in this release

The ./docs folder (relative to the root directory of this release)
contains five files documenting various characteristics of the source
data:

char_tally.IL4.tab - contains tab separated columns: doc uid, number of
non-whitespace characters, number of non-whitespace characters in the
expected script, and number of anomalous (non-printing) characters for
each document in the release

source_codes.txt - contains tab-separated columns: genre, source code,
source name, and base url for each source in the release

twitter_info.tab - contains tab-separated columns: doc uid, tweet id,
normalized md5 of the tweet text, and tweet author id for all tweets in
the release

urls.tab - contains tab-separated columns: doc uid and url.

il4_partitions.tab - contains information about the set partitions for
the Incident Language Pack configuration of the data in this
corpus. This file is a tab-delimited flat table of two columns with
147993 rows.  The first row is a header with column labels
("partition" and "file_id"); the remaining rows show the intended use
of the associated document in an evaluation. Although there are
Twitter documents listed in the MT partitions, due to Twitter's terms
of use, no translations of tweets are included in this corpus. The
partition labels are:

  set0        "pre-incident" data for use as system training material
  setE        "post-incident" data to be processed by systems for scorable output
  setE,MT     post-incident data for which English translations are provided
  setE,MT,el  translated post-incident data with Entity Linking and Situation Frame annotation

In addition, the grammatical sketch and annotation guidelines
described in earlier sections of this README are found in this
directory.


10.0 KNOWN ISSUES

10.1 Double-encoding of XML character-entity references

A quantity of ltf.xml files in the ".data/translation/comparable" and
".data/translation/from_il4" sets contain strings like "&amp;gt;", "&amp;lt;",
etc. in the text strings of "<ORIGINAL_TEXT>" and "<TOKEN>" elements.  This
means that when a suitable XML parser is used to extract the raw text content,
the output will include tokens like "&gt;", "&lt;", etc. (instead of the
expected ">", "<", etc.).  This occurs in 137 "comparable" files (129 eng/, 8
il4/), and in 42 "from_il4" files (28 eng/, 14 il4).

10.2 Some ./data/translation/from_il4/ source files also present in comparable/ and found/

The "./data/translation/from_il4" set (professional translation) contains some of the
same source documents in IL4/Ukrainian that also yielded "comparable" and
"found" translation data.  This means that the same IL4 source files are
present in two separate directories.  The following file-IDs in "from_il4/il4"
are also present (with identical content) in the other paths indicated below:

  IL4_DF_020072_20100805_G0040G2R8      comparable/il4
  IL4_DF_020072_20100908_G0040HNOL      comparable/il4
  IL4_DF_020072_20110212_G0040FS4H      comparable/il4
  IL4_DF_020073_20100806_G0040HAJV      comparable/il4
  IL4_DF_020073_20100907_G0040HZYT      comparable/il4
  IL4_DF_020073_20100909_G0040HDD5      comparable/il4
  IL4_DF_020073_20101010_G0040HALK      comparable/il4
  IL4_DF_020073_20101015_G0040GGUJ      comparable/il4
  IL4_DF_020073_20101103_G0040HAY7      comparable/il4
  IL4_DF_020073_20110120_G0040GHTU      comparable/il4
  IL4_DF_020073_20110419_G0040HC66      comparable/il4
  IL4_DF_020073_20110518_G0040GGHU      comparable/il4
  IL4_NW_020120_20101101_H0040L4TG      found/il4
  IL4_NW_020120_20120704_H0040L4RN      found/il4
  IL4_NW_020122_20141130_H0040L4W3      found/il4
  IL4_NW_020135_20110929_H0040LEX2      found/il4
  IL4_NW_020143_20110530_H0040LFSV      found/il4
  IL4_NW_020143_20111229_H0040LFS9      found/il4
  IL4_NW_020143_20120227_H0040LFRY      found/il4
  IL4_NW_020143_20120710_H0040LFRA      found/il4
  IL4_NW_020143_20130213_H0040LFQC      found/il4
  IL4_NW_020143_20130420_H0040LFPU      found/il4
  IL4_NW_020143_20131204_H0040LFNC      found/il4


11.0 Acknowledgements

The authors would like to acknowledge the following contributors to
this corpus: Brian Gainor, Ann Bies, Justin Mott, Neil Kuster, Chris
Caruso, University of Maryland Applied Research Laboratory for
Intelligence and Security (ARLIS), formerly UMD Center for Advanced
Study of Language (CASL), and our team of Ukrainian annotators.

This material is based upon work supported by the Defense Advanced
Research Projects Agency (DARPA) under Contract
No. HR0011-15-C-0123. Any opinions, findings and conclusions or
recommendations expressed in this material are those of the author(s)
and do not necessarily reflect the views of DARPA.


12.0 Copyright

Portions © 2002-2007, 2009-2010 Agence France Presse, © 2014-2017 Agroindustrial Association of Ukraine, © 2000 American Broadcasting Company, © 2014, 2016 BBC, © 2000 Cable News Network. LP, LLLP, © 2014-2017 CASE Ukraine, © 2008 Central News Agency (Taiwan), © 2016 Channel 5, © 2010-2016 CSLR, © 2012 Daily Lviv, © 2015 Depo.ua, © 1989 Dow Jones & Company, Inc., © 2012 Ecology of Life, © 2016 Espreso.tv, © 2015-2016 euronews, © 2014-2016 European Truth-eurointegration.com.ua, © 2015-2016 expres.online, © 2015-2016 FACTS.ICTV, © 2013, 2016 Gazeta.ua, © 2016 High Castle Publishing House LLC, © 2016 Hromadske Radio, © 2016 ІНА Ukrainian News, © 2014 Information Agency LIGABiznesInform, © 2009–2017 Institute of World Policy, © 2012-2013 iPress.ua, © 2016 JSC Lux Television and Radio Company-Zaxid.net, © 2010-2011, 2016 Keprate Partners, © 2015-2016, Korrespondent.net, © 2016 LB.ua, © 2015-2016 LLC UBT, © 2005 Los Angeles Times-Washington Post News Service, Inc., © 2016  MEDIA-DK PUBLISHING HOUSE LLC, © 2010 Mediastar, © 2016 Mirror of the week-Ukraine, © 2016 MyInforms.com, © 2000 National Broadcasting Company, Inc., © 2015 National Information Systems LLC, © 2016 NavkoloNas.com, ©  1999, 2005, 2006, 2010 New York Times, © 2016  PJSC Lux Television and Radio Company-Radio Maximum, © 2015 Polskie Radio S.A., © 2000 Public Radio International, © 2016 RFE/RL, © 2016 segodnya.ua, © 2010-2017 SFTC Ukrinterenergo, © 2003, 2005-2008, 2010 The Associated Press, © 2016-2017 The National Radio Company of Ukraine, ©  2015-2016 TSN.ua, © 2016 uapress, © 2010-2015 Ukraine Municipal Local Economic Development Project - Federation of Canadian Municipalities, © 2016 Ukrainian National News-news agency, © 2010, 2014-2016 Ukrainian Truth, © 2016 Ukrainians in Portugal-Union of Ukrainians in Portugal, © 2015-2016 Ukrinform, © 2010-2014 UNIAN.NET, © 2016 Week.ua, © 2016 Western Information Corporation, © 2003, 2005-2008 Xinhua News Agency, © 2016 ZNAJ.UA, © 2020 Trustees of the University of Pennsylvania


13.0 Contacts

Jennifer Tracey <garjen@ldc.upenn.edu> - LORELEI Project Manager
Stephanie Strassel <strassel@ldc.upenn.edu> - LORELEI PI