README FILE FOR LDC CATALOG ID: LDC2020T22

TITLE: LORELEI Tigrinya Incident Language Pack

AUTHORS: Jennifer Tracey, Dave Graff, Stephanie Strassel, Michael Arrigo,
         Jonathan Wright, Ann Bies


1.0 Introduction

This corpus contains all the text data, annotations and supplemental
resources for the Tigrinya language that were produced for use in the
DARPA LORELEI / LoReHLT 2017 Evaluation, which was conducted by NIST
in August of that year.

Detailed information about the corpus content is provided in section 3
for each of the partitions ("sets") in the corpus. Combining all sets,
the corpus contains approximately 4.5 million words of monolingual text
in Tigrinya, 25,000 words of monolingual text in English, 235,000 words
of parallel and comparable Tigrinya-English text, and 50,000 words of
data annotated for Entity Discovery and Linking and Situation Frames.

The LORELEI (Low Resource Languages for Emergent Incidents) Program is
concerned with building Human Language Technology for low resource
languages in the context of emergent situations like natural disasters
or disease outbreaks. Linguistic resources for LORELEI include
Representative Language Packs for over 2 dozen low resource languages,
comprising data, annotations, basic natural language processing tools,
lexicons and grammatical resources. Representative languages are
selected to provide broad typological coverage, while Incident
Languages are selected to evaluate system performance on a language
whose identity is disclosed at the start of the evaluation, and for
which no training data has been provided. This package comprises all
of the resources and test set references for Tigrinya, which was one of
the Program's Incident Languages.

The evaluation protocol is based on a scenario in which some unforeseen event
(the "incident") triggers a need for humanitarian and logistical support in a
region where the predominant language (the "incident language") is one that
has received little or no attention as yet in NLP research.  The objective for
evaluation participants is to provide NLP solutions, including information
extraction and machine translation, based only on limited resources and with
very little time for development.

For more information about LORELEI language resources, see
https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2016-lorelei-language-packs.pdf.

Each incident language pack has one or more focal incidents (a natural
disaster or other event which might trigger humanitarian needs). To
support the evaluation scenario, the evaluation package contents are
divided into the following subsets:

  set0 : "pre-incident" text data and reference resources for the
         language, including monolingual text, dictionaries, grammars,
         and parallel or comparable text (in English and the incident
         language); monolingual and parallel data in this set includes
         documents published prior to the beginning of the earliest
         focal incident and/or reference materials for which
         publication date is not relevant, such as religious
         materials

  setE : "post-incident" text data that forms the basis for scoring
         NLP system performance (using the scoring protocol and
         software developed by NIST); set E consists of monolingual
         text, along with reference translations and annotations

  setS : "post-incident" text data in English, including information
         that pertains to the incident itself; this was made available
         to systems after the initial set of scorable outputs had been
         submitted

  set1 : supplemental "post-incident" text data, made available after
         the initial set of scorable outputs had been submitted

  set2 : a larger set of supplemental "post-incident" text data, made
         available after the second set of scorable outputs had been
         submitted

In addition to these standard LORELEI sets, Tigrinya also features a
small set of data not found in other Incident Language Packs. The
Named Entity and Parallel Text data from the 2007 DARPA REFLEX
Language Pack for Tigrinya are included in their original form in a
separate directory.

Each subset is presented as a directory within the data folder at the
top-level of the release package.  Tools for data processing are provided
as part of set0 only, but are applicable to all sets.


2.0 Corpus organization

2.1 Directory Structure

The directory structure and contents of the package are summarized
below -- paths shown are relative to the base (root) directory of the
package:

./docs/README.txt  -- this file

./data/set0/
./data/set0/tools/ -- software for data file format conversion
./data/set0/data/ -- monolingual and parallel text directories
./data/set0/dtds/ -- DTDs for all .xml data formats
./data/set0/docs/ -- lexical and grammatical resources, information
            about various set0 components and properties

./data/set1/data/ -- monolingual text
./data/set1/docs/ -- information about various set1 components and properties

./data/set2/data/ -- monolingual text
./data/set2/docs/ -- information about various set2 components and properties

./data/setS/data/ -- monolingual text
./data/setS/docs/ -- information about various setS components and properties

./data/setE/data/monolingual_text/ -- monolingual text directory
./data/setE/data/annotation/
     il5_edl.tab -- table of Entity-Detection-Linking annotations
     situation_frame/ -- subdirectories for entity mentions, needs, and issues tables 

./data/setE/data/translation/
     eng/
          ltf/ -- ltf.xml files (*.eng_A and *.eng_B versions for each Tigrinya doc)
          psm/ -- psm.xml files
     il5/
          ltf/ -- ltf.xml files
          psm/ -- psm.xml files

./data/setE/docs/ -- information about various setE components and properties

./data/REFLEX_Tigrinya 
./data/REFLEX_Tigrinya/data/Named_Entity_Annotations/ -- original REFLEX
     named entity annotations, presented in Train and Eval subdirectories
     containing ltf.xml source files and laf.xml annotation files; the
     train/test split was carried over from the REFLEX corpus and thus the
     "eval" partition had no special status within the LORELEI evaluation
     (these were not part of the LORELEI test set).
./data/REFLEX_Tigrinya/data/Parallel_Text/ -- directories containing translation
     to and from English, presented in Train and Eval subdirectories; as
     with the Named Entity annotation, the "eval" partition from REFLEX has
     no special status in LORELEI. The training partition contains a
     Special_Corpora directory with translations of a phrasebook and
     elicitation corpus, which are English documents designed to elicit
     conversational sentences (phrasebook) and various grammatical and
     morphological features (elicitation corpus). These are similar, though
     not necessarily identical to the phrasebook and elicitation corpus
     that appear in LORELEI Representative Language Packs.

./data/REFLEX_Tigrinya/dtds -- dtds for the ltf and laf xml files that
     appear in this section of the corpus. Note that the LTF and LAF
     formats used in 2007 are different from those used in current
     LORELEI collections.  It turns out that the DTDs included with
     recent LORELEI data releases are backwards compatible with
     REFLEX 2007 xml files, but users will notice diffences in various
     element attributes, relative to LTF data that results from
     current processing for LORELEI.

./data/REFLEX_Tigrinya/docs -- format description document and annotation
     guidelines for the REFLEX data

2.2 File Name Conventions

All monolingual text documents are presented as distinct files with unique
file names.  For convenience, each file name provides a consistent set of
information about the content of the file via a set of fixed-width fields,
as follows:

  - Language (3 letters)
  - Genre (2 letters)
  - Source (6-digit numeric)
  - Date (8-digit numeric)
  - Unique Index Number (9 alpha-numeric characters)

The language field for all Tigrinya documents uses "IL5" instead of the
ISO code for the language, as the practice in LORELEI was to refer to
incident languages by numeric identifiers to preserve the secrecy of
the language name until the start of the evaluation.

The date field for news reports represents the date of original publication
for the report.  Where possible, discussion forum material uses the date when
a given discussion thread was initiated.  When date information is not
available or meaningful for a given document, the date field will reflect
(roughly) the time at which the content was initially collected by the LDC,
and may be left "incomplete" by setting the "day" field (last two digits) to
zero (e.g. "20140900").

Files containing translations from a source language have the source language
identified in the "Language Code" field of the file name, and the translation
language as a 3-letter extension that immediately follows the main part of the
file name.

Pairs of corresponding files in "found" translation may have distinct
identifier strings (one with IL5 in the initial file name field, and one with
ENG in that field), if they were harvested independently of each other and
were later found to contain parallel content.  Alternately, some sources of
found translation data present their own source and translated text as a
single unit, in which case the corresponding pair of files will have a single
identifier string, and the English member of the pair will have ".eng"
appended.  In the former case, the alignment data specifies how the IL5 and
ENG files are paired.

2.3 Genres

Five genres are represented in this data set, as follows:

 NW - news and general text harvested from news sites 
 SN - "social network" data (i.e. Twitter)
 WL - weblog and newsgroup data
 DF - discussion forum data
 RF - data from "reference" materials, including religious text, government/NGO information sites, etc.

Note that the SN (Twitter) data cannot be distributed directly by LDC, due to
the Twitter Terms of Use.  The file "docs/twitter_info.tab" (described in
Section 6.0 below) provides the necessary information for users to fetch the
particular tweets directly from Twitter.


3.0 Content Summary

3.1 Set 0

3.1.1 Monolingual text

Document and token counts of monolingual text by genre:
 
Genre   N_Docs   N_Tokens
NW       4,949   2,330,470
SN      16,083     249,496
WL       2,311     838,385

3.1.2 Parallel and comparable text

Parallel text document and token count by genre (counts based on
Tigrinya documents):

Genre   N_Docs  N_Tokens
NW          40    12,314
RF         108    39,965

Comparable text document and token count by genre (counts based on
Tigrinya documents):

Genre   N_Docs  N_Tokens
NW         251   117,510
WL         301    65,940

All parallel text is aligned at the sentence level, while comparable
text is aligned into clusters of documents based on topic
similarity. Parallel and comparable text for Tigrinya and English can
be found in set0/data/translation/, which contains the following
structure of subdirectories:

   found/
     sentence_alignment/
     eng/{ltf,psm}/
     il5/{ltf,psm}/

   comparable/
     clusters/
     eng/{ltf,psm}/
     il5/{ltf,psm}/

The "found" data set consists of files from web data sources that had
parallel text content in Tigrinya and English.  Each "leaf" directory in
the tree (*/ltf, */psm, sentence_alignment) contains a matched set of
data files.  Parallel file pairs were identified and harvested
automatically, processed into LTF.xml format, and then aligned at the
level of "segments" (putative sentences).  The alignment files
(*.align.xml) contain one or more "alignment" elements, in which one
or more "source" (English) segments is associated with one or more
"translation" (Tigrinya) segments.  It's not assured that all segments in
a given (Tigrinya or English) data file are accounted for in a given set
of alignments. The sentence alignment files contain references to the
source document and the translation document (both files can be found
in their respective directories), and multiple "alignment" elements,
each of which contains one source element and one translation element.
The "segments" attribute of the source and translation element
contains space delimited segment ids referring to SEG IDs in the
corresponding ltf files.

NB: We refer to English as the "source" purely as a matter of
convenience and consistency across language packs; we do not have
confirmable evidence as to the true original language of a given data
file.  In fact, for some web data sources, it may be the case that
documents were translated from some third language into both English
and Tigrinya.

The "comparable" data set is a more loosely structured inventory of
data files in which particular topics appear to be present in
documents in both languages during roughly the same period of time.
LDC used the results from two clustering techniques:

    (1) Kutuzov et al. (https://arxiv.org/abs/1604.05372) for 
            multilingual document clustering on English and Tigrinya.

    (2) Cosine similarity for monolingual document clustering on 
            English that was later augmented with Tigrinya documents.

For both approaches, the data was run on the tokenization of the
documents LTF.xml.  The documents were divided into different sets,
where each set includes all documents with dates that span two weeks
(the weeks do not overlap).  The final comparable text clusters
consist of English and IL5 documents that were clustered from both
approaches that fall within the same time period.

Note that some documents appear in multiple clusters.

The cluster files have names patterned as follows:

    GN_clusters_YYYY-MM-DD_YYYY-MM-DD.xml

where "GN" is either "dfwl" or "nw" (lower-case), representing the 
genre of the cluster (newswire or discussion forum/weblog).  The third 
and fourth fields are the beginning and end dates of the time span 
during which the data files in that cluster were authored. The xml 
structure in each cluster file consists of one or more "cluster" 
elements, each of which contains some quantity of "doc" elements from 
each language.

3.1.3 Lexical and grammatical resources

The docs/ directory contains two subdirectories:

  categoryI_dictionary/ 
  This directory contains the file IL5_dictionary.txt, which
  is a parallel English-Tigrinya wordlist compiled by LDC, and
  a file called IL5_CategoryI_dictionaryinfo.pdf, which provides
  pointers to additional bilingual dictionaries available online.

  categoryII/ 

  LORELEI Incident Language packs were required to contain (pointers
  to) at least 5 of the following 8 "category II" resources:

-- bilingual IL-non-English dictionary
-- monolingual IL dictionary
-- bilingual grammar (reference grammar of the IL in English)
-- monolingual grammar in the IL
-- monolingual primer (grammar in the IL of the type used by school children)
-- bilingual gazetteer
-- monolingual gazetteer in the IL
-- monolingual gazetteer in English covering the incident region

The categoryII directory contains a pdf file (CategoryII_list.pdf)
with additional information and URLs for the resources identified. The
bilingual_gazetteer.txt is from Geonames (www.geonames.org) and is a
gazetteer for the country of Ethiopia. The parallel_grammar.pdf is a 
grammatical sketch for Tigrinya originally created by LDC in 2007 for
another program and included here as a "found" resource.

3.2 Set 1

All data in this set is monolingual text in Tigrinya from the date of the
incident that serves as the focus of the evaluation and later.  It may
contain some information about the incident, but also contains
documents whose content is not relevant to the incident in any way.

Genre   N_Docs  N_Tokens
NW         445   160,294
RF           4       885
SN        2939    47,769
WL         454    60,659

3.3 Set 2

All data in this set is monolingual text in Tigrinya from the date of the
incident that serves as the focus of the evaluation and later.  It may
contain some information about the incident, but also contains
documents whose content is not relevant to the incident in any way.

Genre   N_Docs  N_Tokens
NW         899   364,031
RF           6     1,393
SN       5,877    95,214
WL         904   122,647

3.4 Set S

All data in this set is monolingual text in English from the date of
the incident that serves as the focus of the evaluation and later.  It
may contain some information about the incident, but also contains
documents whose content is not relevant to the incident in any way.

Genre   N_Docs  N_Tokens
NW          23    26,445

3.5 Set E

3.5.1 Monolingual Text

This data set provides monolingual source data for the LORELEI 2017
Evaluation Test Set in Tigrinya.  All data in this set is from the date
of the incident that serves as the focus of the evaluation and later.

Genre   N_Docs  N_Tokens
DF           1       568
NW         278    99,374
SN       2,508    50,130
WL         204    50,044
total    2,991   200,116

Because annotations obey the "full-token rule", meaning that all
reference annotation extents coincide with token boundaries as
provided by the automatic tokenization process, it was deemed to be
important for participants in the evaluation to be able to match the
LDC's tokenization for Twitter documents that they retrieved directly
from the Twitter API. For this reason, in set E only, the
monolingual_text directory contains "scrubbed" ltf for Twitter
documents. These ltf documents contain none of the actual tweet
content, but instead contain a series of underscores and whitespace
which allow users to match the tokenization of the tweet via the
character offsets provided in the ltf file.

3.5.2 Translation

Human reference translations were provided for a subset of the data in
the test set. Each Tigrinya document was translated into English by two
independent translators, and both translations are presented (with "A"
or "B" appended to the filename).

Genre   N_Docs  N_Tokens
DF           1       568
NW          96    25,762
WL         124    24,663
total      221    50,993


The translation/ directory under setE/data/ contains source and reference translation files, as follows:
    il5/{ltf,psm}/ -- contain 221 ltf/psm pairs
    eng/{ltf,psm}/ -- contain 442 ltf/psm pairs: two reference translations,
                      having "eng_A" and "eng_B" in their respective file
                      names, for each source file

3.5.3 Annotation

Entity Detection and Linking and Situation Frame annotations were
applied to a subset of the data in the translation set, in order to
identify "entities", "needs" and "issues" to be detected by systems
for scoring purposes:

Genre   N_Docs  N_Tokens
DF           1       568
NW          84    22,546
SN         490     9,968
WL         102    16,704
total      677    49,786

Some of the files that received annotation did not yield annotatable
content for one or more annotation types.  The next table shows the
number of files containing reference annotations of each type for each
genre:

        Number of Files containing:
Genre   Ents    Needs   Issues
------------------------------
DF         1        1        0
NW        83       56       43
SN       470      112      176
WL       106       45       62

The annotation/ directory under setE/data/ contains a tab delimited
file "il5_edl.tab" containing the entity linking annotation and a set
of directories containing situation frame annotation as follows:

situation_frame/ -- contains subdirectories for each type:
    issues/
    mentions/
    needs/

Situation Frame annotation is designed to extract basic information
about where needs (such as a need for food) and relevant issues (such
as civil unrest) exist; the information is designed to be of the type
that would be useful for planning a disaster response effort. For
more detailed information about situation frame annotation, see
https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/smerp2017.pdf.

Guidelines for both the EDL and Situation Frame tasks are included in
the docs/ directory of set0.

3.6 REFLEX Data

3.6.1 Parallel Text

Parallel Text is partitioned into 90% Train and 10% Eval:

  Eval/
  Train/

The Eval directories contain Translations in LTF format:

  Eval/Translations/From_English/Tigrinya/ltf/
  Eval/Translations/From_English/English/ltf/
  Eval/Translations/To_English/Tigrinya/ltf/
  Eval/Translations/To_English/English/ltf/

Each of the Train directories is further partitioned into Translations
and Special Corpora. 

  Train/Translations/Tigrinya/
  Train/Translations/English/
  Train/Special_Corpora/Tigrinya/
  Train/Special_Corpora/English/

The document and word counts are as follows (all word counts based on
Tigrinya):

Genre	Partition 	      N_Docs     N_Tokens 
NW	Eval-To_English       14      	 14171
NW	Train-To_English      79      	 124973
NW	Eval-From_English     14       	 5478
NW	Train-From_English    127     	 50173
Special	Elcitation	      1		 21776
Special	Phrasebook 	      1       	 8333

3.6.2 Named Entity Annotation

The Named_Entity_Annotation directory is partitioned into 90% Train
and 10% Eval.

The token IDs in the annotation (LAF) files point to the token IDs used in
text (LTF) files in the same directory.  That is, each of the
directories (Train and Eval) contains LAF and LTF files, e.g.

  ABC_TIR_20001128.1830.1262.ltf.xml
  ABC_TIR_20001128.1830.1262.laf.xml

The document and word counts are as follows:

Genre     Partition   N_Docs N_Tokens
NW	  Eval        15     8,194
NW	  Train	      125    87,178

4.0 Data Formats

The data formats described below are common across all sets.

4.1 PSM - Primary Structural Markup

When original data has structural markup interleaved with the language
content, we apply a filtering process that, in effect, separates the markup
and language content into distinct files.  The language content (with
white-space normalization) goes into an RSD file (see below), and the relevant
markup content goes into a corresponding PSM file, which is a simple XML
stream comprising tags with attributes, and no other text content of its own.
(Configuring the filter for a given data source involves determining which
content and markup are "relevant"; the filter eliminates other content and
markup as irrelevant, such as ads, navigation menus, etc.)

Each PSM file has a single "psm" tag as its root element, and contains one
or more "string" tags.  Each "string" refers to some span of text in the
corresponding RSD file, using "begin_offset" and "char_length" attributes, and
assigns a label to it, using a "type" attribute.  (Note that offsets and
lengths are expressed as Unicode CHARACTER counts, not byte counts.)

The "type" attribute tells what sort of markup tag was used in the original
data to contain the given string (e.g. "p", "quote", etc.); when sentence
segmentation can be done as part of the filtering step, a "string" tag with
type="seg" is used to label the span of each detected sentence.

Some structural tags in original data contain attributes that may be relevant
to language research; for example, in a file that contains a thread from a
discussion forum, it's useful to keep track of the dates and authors of posts
within the thread.  For these cases, the "string" element can contain one ore
more "attribute" elements, to preserve the name and value of the given
attribute - e.g.:

  <string type="post" begin_offset="16" char_length="317">
    <attribute name="author" value="zazakara123"/>
    <attribute name="datetime" value="2013-11-16T11:32:00"/>
    <attribute name="id" value="p1"/>
  </string>

As shown in this example, the "attribute" tag is also used, where appropriate,
to assign an ID value (unique within the file) to each string of a given type;
this is also used with the "seg"-type strings to assign IDs to detected
sentences.

PSM files appear in the data/monolingual_text/ and data/translation/
directories of each set.

4.2 LTF - LORELEI Text Format

LTF was originally developed for language packs produced in the REFLEX Program
("LCTL Text Format").  This XML format uses structural tags "SEG" and "TOKEN"
to mark sentence segmentation and word tokenization of the source data.  The
full original text of each sentence (SEG) is contained in an "ORIGINAL_TEXT"
tag, and each individual word and punctuation string is contained, in order of
occurrence, in a sequence of "TOKEN" elements, along with various attributes
for each token.  Both SEG and TOKEN attributes include character offsets
relative to beginning of the raw source data ("RSD" file format, described
below), with the offset of the first character being 0.

LTF files appear in the data/monolingual_text/ and data/translation/ (where
applicable) directories of each set.

4.3 EDL (Entity Detection and Linking)

The file "il5_edl.tab" contains all EDL annotations for the IL5 EDL subset.
The table contains eight columns, as follows:

column 1: system_run_id -- "LDC" 
column 2: mention_id
column 3: mention_text
column 4: extents
column 5: kb_id -- numeric-ID or "NIL"+numeric, may contain multiple KB links
                   separated by | ("pipe" symbol)
column 6: entity_type
column 7: mention_type
column 8: confidence

When column 5 is fully numeric, it is a citation to a numbered entity in the
LORELEI Entity Detection and Linking Knowledge Base (distributed separately 
as LDC2020T10); when it consists of "NIL" plus digits, it refers to an entity 
that is not present in the Knowledge Base, but this label is used consistently 
for all mentions of the particular entity.

Note that for any annotated Twitter documents, text extents have been
replaced by underscore ("_") characters to comply with the prohibition
against distributing the text of tweets directly. Character offsets
can be used to align the annotations with the tweets once the user has
downloaded them using Twitter's API.

4.4 Situation Frame

Situation frame annotation consists of three parts, each presented as a
separate tab-delimited file: entities, needs, and issues. The details of each
table are described below.

Entities, mentions, need frames, and issue frames all have IDs that follow a
standard schema consisting of a prefix designating the type of ID ('Ent' for
entities, 'Men' for mentions, and 'Frame' for both need and issue frames), an
alphanumeric string identifying the annotation "kit", and a numeric string
uniquely identifying the specific entity, mention, or frame within the
document.

4.4.1 Entities

The grouping of entity mentions into "selectable entities" for situation frame
annotation is provided in the mentions/ subdirectory. The table has 8 columns
with the following headers and descriptions:

column 1: doc_id -- doc ID of source file for the annotation
column 2: entity_id -- unique identifier for each grouped entity
column 3: mention_id -- unique identifier for each entity mention
column 4: entity_type -- one of PER, ORG, GPE, LOC
column 5: mention_status -- 'representative' or 'extra';
          representative mentions are the ones which have been chosen by the
          annotator as the representative name for that entity. Each entity
          has exactly one representative mention.
column 6: start_char -- character offset for the start of the mention
column 7: end_char -- character offset for the end of the mention
column 8: mention_text -- mention string

Again, note that for any annotated Twitter documents, text extents
have been replaced by underscore ("_") characters to comply with the
prohibition against distributing the text of tweets directly.

4.4.2 Needs

Annotation of need frames is provided in the needs/ subdirectory. Each row in
the table represents a need frame in the annotated document. The table has 13
columns with the following headers and descriptions:

column 1: user_id -- user ID of the annotator
column 2: doc_id -- doc ID of source file for the annotation
column 3: frame_id -- unique identifier for each frame
column 4: frame_type -- 'need'
column 5: need_type -- exactly one of 'evac' (evacuation), 'food' (food
          supply), 'search' (search/rescue), 'utils' (utilities, energy, or
          sanitation), 'infra' (infrastructure), 'med' (medical assistance),
          'shelter' (shelter), or 'water' (water supply)
column 6: place_id -- entity ID of the LOC or GPE entity identified as the
          place associated with the need frame; only one place value per
          need frame, must match one of the entity IDs in the corresponding
          ent_output.tsv or be 'none' (indicating no place was named)
column 7: proxy_status -- 'True' or 'False'
column 8: need_status -- 'current', 'future'(future only), or 'past' (past only)
column 9: urgency_status -- 'True' (urgent) or 'False' (not urgent)
column 10: resolution_status -- 'sufficient' or 'insufficient' (insufficient /
           unknown sufficiency)
column 11: reported_by -- entity ID of one or more entities reporting
           the need; multiple values are comma-separated, must match entity IDs
           in the corresponding ent_output.tsv or be 'none'
column 12: resolved_by -- entity ID of one or more entities resolving
           the need; multiple values are comma-separated, must match entity IDs
           in the corresponding ent_output.tsv or be 'none'
column 13: description -- string of text entered by the annotator as
           memory aid during annotation, no requirements for content or language,
           may be 'none'

4.4.3 Issues

Annotation of issue frames is provided in the issues/ subdirectory.  Each row
in the table represents an issue frame in the annotated document. The table has
9 columns with the following headers and descriptions:

column 1: user_id -- user ID of the annotator
column 2: doc_id -- doc ID of source file for the annotation
column 3: frame_id -- unique identifier for each frame
column 4: frame_type -- 'issue'
column 5: issue_type -- exactly one of 'regimechange' (regime change),
          'crimeviolence' (civil unrest or widespread crime), or 'terrorism'
          (terrorism or other extreme violence)
column 6: place_id -- entity ID of the LOC or GPE entity identified as
          the place associated with the issue frame; only one place value per
          issue frame, must match one of the entity IDs in the corresponding
          ent_output.tsv or be 'none'
column 7: proxy_status -- 'True' or 'False'
column 8: issue_status -- 'current' or 'not_current'
column 9: description -- string of text entered by the annotator as
          memory aid during annotation, no requirements for content or
          language, may be 'none'


5.0 Software tools included in this release

All software tools are provided in the tools/ directory of Set 0.

5.1 "ltf2txt" (source code written in Perl)

A data file in ltf.xml format (as described above) can be conditioned
to recreate exactly the "raw source data" text stream (the rsd.txt
file) from which the LTF was created.  The tools described here can be
used to apply that conditioning, either to a directory or to a zip
archive file containing ltf.xml data.  In either case, the scripts
validate each output rsd.txt stream by comparing its MD5 checksum
against the reference MD5 checksum of the original rsd.txt file from
which the LTF was created.  (This reference checksum is stored as an
attribute of the "DOC" element in the ltf.xml structure; there is also
an attribute that stores the character count of the original rsd.txt
file.)

Each script contains user documentation as part of the script content;
you can run "perldoc" to view the documentation as a typical unix man
page, or you can simply view the script content directly by whatever
means to read the documentation.  Also, running either script without
any command-line arguments will cause it to display a one-line
synopsis of its usage, and then exit.

   ltf2rsd.perl -- convert ltf.xml files to rsd.txt (raw-source-data)

   ltfzip2rsd.perl -- extract and convert ltf.xml files from zip archives

5.2 "twitter-processing" (source code written in Ruby)

Due to the Twitter Terms of Use, the text content of individual tweets
cannot be redistributed by the LDC.  As a result, users must download
the tweet contents directly from Twitter and condition/normalize the
text in a manner equivalent to what was done by the LDC, in order to
reproduce the Tigrinya raw text that was used by LDC for annotation.  The
twitter-processing software provided in the tools/ directory enables
users to perform this normalization and ensure that the user's version
of the tweet matches the version used by LDC, by verifying that the
md5sum of the user-downloaded and processed tweet matches the md5sum
provided in the twitter_info.tab file. Users must have a developer
account with Twitter in order to download tweets, and the tool does
not replace or circumvent the Twitter API for downloading tweets.

The twitter_info.tab file provides the twitter download id for each
tweet, along with the LORELEI file name assigned to that tweet and the
md5sum of the processed text from the tweet.  

The file "README.md" in the tools/twitter-processing/ directory
provides details on how to install and use the source code in this
directory in order to condition text data that the user downloads
directly from Twitter and produce both the normalized raw text and the
segmented, tokenized LTF.xml output.

5.3 Encoding

The common framework for text processing in LORELEI includes a
“normalization” step, which allows for rectifying variations in
orthography and/or punctuation that may occur with some frequency in
this or that particular language. For overall simplicity and
consistency in processing across all languages, this normalization
step is always invoked; in languages that require no
special normalization, this step leaves the data unchanged.


6.0 Documentation included in this release

Each set has its own docs directory, but the types of files found
there are consistent across the sets (except REFLEX_Tigrinya), as described below.

IL5_IncidentDescription.pdf and IL5_IncidentDescription_Appendix.pdf:
provide description and additional links and information about the
incidents that were the focus of the evaluation data set. Found in
set0/docs/ only.

SimpleNamedEntityGuidelines_IL5_V1.2.pdf,
EntityLinkingGuidelines_V1.2.1.pdf and
SituationFrameGuidelines_V3.0.pdf: 
guidelines for entity annotation, entity linking, and situation frame
annotation. Found in set0/docs/ only.

twitter_info.tab: 
contains tab-separated columns: doc uid, tweet id, normalized md5 of
the tweet text, and tweet author id for all tweets in the
release. Found in all sets (except set S and REFLEX, which contain no Twitter
data).

source_codes.tab:
contains tab-separated columns: genre, source code, source name, and
base url for each source in the release. Found in all sets except REFLEX.

urls.tab:
contains tab-separated columns: doc uid and url. Note that the url
column is empty for documents from older releases for which the url is
not available; they are included here so that the uids column can
serve as a document list for the package. Found in all sets except REFLEX.

annotated_filelist_EDL.txt, annotated_filelist_MT.txt,
annotated_filelist_SF.tab:
list of all files annotated for the EDL task, all files with human
reference translations, and all files annotated for the Situation
Frame task. Found in setE only.

domain_filelist.tab:
lists all documents for which human reference translations and/or
annotations were produced and provides a domain judgement:
eval_incident (document contains information about the incidents that
were the focus of the evaluation), indomain (document is relevant to
the overall LORELEI domain of humanitarian assistance and disaster
relief and related situations, but not specifically the incident of
focus), or nondomain (document is of unspecified topic, not related to
the LORELEI domain or incidents). Found in setE/docs/ only.

filelist.txt:
lists the doc id for all documents in set E. Found in setE/docs/ only.

LCTL_Formats-v2.5.pdf, TimeAnnotationGuidelinesV1.0.pdf,
SimpleNamedEntityGuidelinesV6.5.pdf:
Guidelines and format descriptions that pertain to the REFLEX
data. Note that the format descriotion may contain information about
formats for data sets that are not inlcuded in this corpus.


7.0 Acknowledgement

The authors would like to acknowledge the following contributors to
this corpus: Song Chen, Dana Delgado, Neville Ryant, Brian Gainor,
Neil Kuster, University of Maryland Applied Research Laboratory for
Intelligence and Security (ARLIS), formerly UMD Center for Advanced
Study of Language (CASL), and our team of Tigrinya annotators.
 
This material is based upon work supported by the Defense Advanced Research
Projects Agency (DARPA) under Contract No. HR0011-15-C-0123.  Any opinions,
findings and conclusions or recommendations expressed in this material are
those of the author(s) and do not necessarily reflect the views of DARPA.


8.0 Copyright

Portions © 2004-2006 Adal.161.com, © 2015 Addis Standard Magazine, ©
2003 Agence France Presse, © 2015-2016 Agroindustrial Association of
Ukraine, © 2015-2016 Al Jazeera Media Network, © 2016 AllAfrica, ©
2000 American Broadcasting Company, © 2005-2006, 2009-2016 Asmarino
Independent, © 2009-2010 Assenna.com, © 2016 Associated Newspapers
Limited, © 2003-2006 Awate.com, © 2016 BBC, © 2000 Cable News Network,
LP, LLP, © 2016 Cable News Network. Turner Broadcasting System, Inc.,
© 2017 democrasia.org, © 2006 Dow Jones & Company, Inc., © 2004
Gabeel.net, © 2016 Geeskaafrika, © 2015 Guardian News & Media Limited
or its affiliated companies, © 2006-2007 Haddas Ertra, © 2016 IPI
International Peace Institute, © 2011, 2013-2105 Lac Viet Computing
Corporation, © 2000 National Broadcasting Company, Inc., © 2004-2006
Nharnet.com, © 2000 Public Radio International, © 2015-2017 Radio
Erena, © 2016 Reuters, © 2003 The Associated Press, © 2016-2017 The
Migrant Project, © 2015 The New Humanitarian, © 2017 The Voice of the
Tigray, © 2016 The Washington Post, © 2016 Tigraionline.com, © 2015
United Nations Office for the Coordination of Humanitarian Affairs, ©
2017 Watch Tower Bible and Tract Society of Pennsylvania, © 2004-2006
www.degebat.com, © 2004-2006 www.hornofafrica.de, © 2003 Xinhua News
Agency, © 2017, 2020 Trustees of the University of Pennsylvania


9.0 Contacts

If you have questions about this data release, please contact the
following personnel at LDC.

    Stephanie Strassel - LORELEI PI
    <strassel@ldc.upenn.edu>

    Jennifer Tracey - LORELEI Project Manager
    <garjen@ldc.upenn.edu>

    Jonathan Wright - LORELEI Technical Lead
    <jdwright@ldc.upenn.edu>