README FILE FOR LDC CATALOG ID: LDC2018T11

TITLE: LORELEI Somali Representative Language Pack - Monolingual and
Parallel Text

AUTHORS: Jennifer Tracey, Dave Graff, Stephanie Strassel, Xiaoyi Ma,
Jonathan Wright


1.0 Introduction

LORELEI Somali Representative Language Pack, Monolingual and Parallel
Text was developed by the Linguistic Data Consortium for the DARPA
LORELEI Program and consists of approximately 13 million words of
monolingual Somali text, approximately 800,000 of which are translated
into English. Another 100,000 words are also translated from English
into Somali.

The LORELEI (Low Resource Languages for Emergent Incidents) Program is
concerned with building Human Language Technology for low resource
languages in the context of emergent situations like natural disasters
or disease outbreaks. Linguistic resources for LORELEI include
Representative Language Packs for over 2 dozen low resource languages,
comprising data, annotations, basic natural language processing tools,
lexicons and grammatical resources. Representative languages are
selected to provide broad typological coverage, while Incident
Languages are selected to evaluate system performance on a language
whose identity is disclosed at the start of the evaluation, and for
which no training data has been provided.

This corpus comprises the complete set of monolingual text and parallel
text from the LORELEI Somali Representative Language Pack. The other
components of the Somali Representative Language Pack appear in a
separate corpus.

For more information about LORELEI language resources, see
https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2016-lorelei-language-packs.pdf. 


2.0 Corpus organization

2.1 Directory Structure

The directory structure and contents of the package are summarized
below -- paths shown are relative to the base (root) directory of the
package:

./README.txt  -- this file

./dtds/
./dtds/ltf.v1.5.dtd
./dtds/psm.v1.0.dtd

./tools/
./tools/ltf2txt -- software for extracting raw text from ltf.xml data files
./tools/twitter-processing -- software for conditioning Twitter text data

./data/monolingual_text/zipped/ -- zip-archive files containing
                                   monolingual "ltf" and "psm" data

./data/translation/
   from_som/{som,eng}/ -- translations from Somali to English
   from_eng/{som,eng}/ -- translations from English to Somali
                          for each language in each direction,
                          "ltf" and "psm" directories contain
                          corresponding data files

2.2 File Name Conventions

There are 106 *.ltf.zip files in the monolingual_text/zipped directory,
together with the same number of *.psm.zip files.  Each {ltf,psm}.zip file
pair contains an equal number of corresponding data files.  The "file-ID"
portion of each zip file name corresponds to common substrings in the file
names of all the data files contained in that archive.  For example:

./data/monolingual_text/zipped/SOM_DF_G00201.ltf.zip contains:
      ltf/SOM_DF_001600_20070125_G00201AZ3.ltf.xml
      ltf/SOM_DF_001600_20071210_G00201AX2.ltf.xml
      ...

./data/monolingual_text/zipped/SOM_WL_G00239.psm.zip contains:
      psm/SOM_WL_002629_20060929_G00239UBA.psm.xml
      psm/SOM_WL_002629_20060929_G00239UBC.psm.xml
      ...

The file names assigned to individual documents within the zip archive
files provide the following information about the document:

   Language  3-letter abbrev.
   Genre     2-letter abbrev.
   Source    6-digit numeric ID assigned to data provider
   Date      8-digit numeric: YYYYMMDD year, month, day)
   Global-ID 9-digit alphanumeric assigned to this document

Those five fields are joined by underscore characters, yielding a
32-character file-ID; three portions of the document file-ID are used
to set the name of the zip file that holds the document: the Language
and Genre fields, and the first 6 digits of the Global-ID.

The 2-letter codes used for genre are as follows:

   DF -- discussion forum
   NW -- news
   RF -- reference (e.g. Wikipedia)
   SN -- social network (Twitter)
   WL -- web-log


3.0 Content Summary

3.1 Monolingual Text

Genre	#Docs	#Tokens
NW	29728	6285434
DF	7791	2112681
WL	16468	4969323
RF	4	4663
SN	7493	107871
Total	61484	13479972

Note that the SN (Twitter) data cannot be distributed directly by LDC, due to
the Twitter Terms of Use.  The file "docs/twitter_info.tab" (described in
Section 7.0 below) provides the necessary information for users to fetch the
particular tweets directly from Twitter.

3.2 Parallel Text

Type	Genre	#Docs	#Segs	#Tokens
FromEng	EL	2	3723	18549
FromEng	NW	190	3913	87438
ToEng	DF	977	11727	217766
ToEng	NW	2449	19756	448075
ToEng	WL	374	10219	158497
Total		3992	49338	930325

4.0 Data Collection and Parallel Text Creation

Both monolingual text collection and parallel text creation involve a
combination of manual and automatic methods. These methods are
described in the sections below.

4.1 Monolingual Text Collection

Data is identified for collection by native speaker "data scouts," who
search the web for suitable sources, designating individual documents
that are in the target language and discuss the topics of interest to
the LORELEI program (humanitarian aid and disaster relief). Each
document selected for inclusion in the corpus is then harvested, along
with the entire website when suitable. Thus the monolingual text
collection contains some documents which have been manually selected
and/or reviewed and many others which have been automatically
harvested and were not subject to manual review.

4.2 Parallel Text Creation

Parallel text for LORELEI was created using three different methods,
and each LORELEI language may have parallel text from one or all of
these methods. In addition to translation from each of the LORELEI
languages to English, each language pack contains a "core" set of
English documents that were translated into each of the LORELEI
Representative Languages. These documents consist of news documents, a
phrasebook of conversational sentences, and an elicitation corpus of
sentences designed to elicit a variety of grammatical structures. All
translations are aligned at the sentence level. For professional and
crowdsourced translation, the segments align one-to-one between the
source and target language (i.e. segment 1 in the English aligns with
segment 1 in the source language). For found parallel text, automatic
alignment is performed and a separate alignment file provides
information about how the segments in the source and translation are
aligned.

Professionally translated data has one translation for each source
document, while crowdsourced translations have up to four translations
for each source document, designated by A, B, C, or D appended to the
file name on the multiple translation versions.

5.0 Data Processing and Character Normalization for LORELEI

Most of the content has been harvested from various web sources using
an automated system that is driven by manual scouting for relevant
material.  Some content may have been harvested manually, or by means
of ad-hoc scripted methods for sources with unusual attributes.

All harvested content was initially converted from its original HTML
form into a relatively uniform XML format; this stage of conversion
eliminated irrelevant content (menus, ads, headers, footers, etc.),
and placed the content of interest into a simplified, consistent
markup structure.

The "homogenized" XML format then served as input for the creation of
a reference "raw source data" (rsd) plain text form of the web page
content; at this stage, the text was also conditioned to normalize
white-space characters, and to apply transliteration and/or other
character normalization, as appropriate to the given language.

6.0 Overview of XML Data Structures

6.1 PSM.xml -- Primary Source Markup Data

The "homogenized" XML format described above preserves the minimum set
of tags needed to represent the structure of the relevant text as seen
by the human web-page reader.  When the text content of the XML file
is extracted to create the "rsd" format (which contains no markup at
all), the markup structure is preserved in a separate "primary source
markup" (psm.xml) file, which enumerates the structural tags in a
uniform way, and indicates, by means of character offsets into the
rsd.txt file, the spans of text contained within each structural
markup element.

For example, in a discussion-forum or web-log page, there would be a
division of content into the discrete "posts" that make up the given
thread, along with "quote" regions and paragraph breaks within each
post.  After the HTML has been reduced to uniform XML, and the tags
and text of the latter format have been separated, information about
each structural tag is kept in a psm.xml file, preserving the type of
each relevant structural element, along with its essential attributes
("post_author", "date_time", etc.), and the character offsets of the
text span comprising its content in the corresponding rsd.txt file.

6.2 LTF.xml -- Logical Text Format Data

The "ltf.xml" data format is derived from rsd.txt, and contains a
fully segmented and tokenized version of the text content for a given
web page.  Segments (sentences) and the tokens (words) are marked off
by XML tags (SEG and TOKEN), with "id" attributes (which are only
unique within a given XML file) and character offset attributes
relative to the corresponding rsd.txt file; TOKEN tags have additional
attributes to describe the nature of the given word token.

The segmentation is intended to partition each text file at sentence
boundaries, to the extent that these boundaries are marked explicitly
by suitable punctuation in the original source data.  To the extent
that sentence boundaries cannot be accurately detected (due to
variability or ambiguity in the source data), the segmentation process
will tend to err more often on the side of missing actual sentence
boundaries, and (we hope) less often on the side of asserting false
sentence breaks.

The tokenization is intended to separate punctuation content from word
content, and to segregate special categories of "words" that play
particular roles in web-based text (e.g. URLs, email addresses and
hashtags).  To the extent that word boundaries are not explicitly
marked in the source text, the LTF tokenization is intended to divide
the raw-text character stream into units that correspond to "words" in
the linguistic sense (i.e. basic units of lexical meaning).

7.0 Software tools included in this release

7.1 "ltf2txt" (source code written in Perl)

A data file in ltf.xml format (as described above) can be conditioned
to recreate exactly the the "raw source data" text stream (the rsd.txt
file) from which the LTF was created.  The tools described here can be
used to apply that conditioning, either to a directory or to a zip
archive file containing ltf.xml data.  In either case, the scripts
validate each output rsd.txt stream by comparing its MD5 checksum
against the reference MD5 checksum of the original rsd.txt file from
which the LTF was created.  (This reference checksum is stored as an
attribute of the "DOC" element in the ltf.xml structure; there is also
an attribute that stores the character count of the original rsd.txt
file.)

Each script contains user documentation as part of the script content;
you can run "perldoc" to view the documentation as a typical unix man
page, or you can simply view the script content directly by whatever
means to read the documentation.  Also, running either script without
any command-line arguments will cause it to display a one-line
synopsis of its usage, and then exit.

   ltf2rsd.perl -- convert ltf.xml files to rsd.txt (raw-source-data)

   ltfzip2rsd.perl -- extract and convert ltf.xml files from zip archives

7.2 "twitter-processing" (source code written in Ruby)

Due to the Twitter Terms of Use, the text content of individual tweets
cannot be redistributed by the LDC.  As a result, users must download
the tweet contents directly from Twitter and condition/normalize the
text in a manner equivalent to what was done by the LDC, in order to
reproduce the Somali raw text that was used by LDC for annotation (to
be released separately).  The twitter-processing software provided in
the tools/ directory enables users to perform this normalization and
ensure that the user's version of the tweet matches the version used
by LDC, by verifying that the md5sum of the user-downloaded and
processed tweet matches the md5sum provided in the twitter_info.tab
file. Users must have a developer account with Twitter in order to
download tweets, and the tool does not replace or circumvent the
Twitter API for downloading tweets.

The twitter_info.tab file provides the twitter download id for each
tweet, along with the LORELEI file name assigned to that tweet and the
md5sum of the processed text from the tweet.

The file "README.md" in this directory provides details on how to install and
use the source code in this directory in order to condition text data that the
user downloads directly from Twitter and produce both the normalized raw text
and the segmented, tokenized LTF.xml output.


8.0 Documentation included in this release

The ./docs folder (relative to the root directory of this release)
contains four files:

char_tally.{lng}.tab - contains tab separated columns: doc uid, number of
  non-whitespace characters, number of non-whitespace characters in the
  expected script, and number of anomalous (non-printing) characters for
  each document in the release

source_codes.txt - contains tab-separated columns: genre, source code,
  source name, and base url for each source in the release

twitter_info.tab - contains tab-separated columns: doc uid, tweet id,
  normalized md5 of the tweet text, and tweet author id for all tweets in
  the release

urls.tab - contains tab-separated columns: doc uid and url. Note that
  the url column is empty for documents from older releases for which the url
  is not available; they are included here so that the uids column can
  serve as a document list for the package.


9.0 KNOWN ISSUES

9.1 Double-escaping of character entity references in some translation data

In the "from_som" translation set, one Somali/English ltf.xml file
pair in the "WL" genre, contains tokens like "&amp;amp;" or
"&amp;lt;".  As a result, the raw text extracted from this file will
contain tokens like "&amp;" or "&lt;" (i.e. XML character entities
that normally would not be seen in raw, unstructured text).  In the WL
data, this should pose no problem when using the data -- there was one
Somali source file containing some residual HTML markup.

9.2 Inconsistent sentence boundary detection in some monolingual text data

Late in the course of data collection for this language, a flaw was discovered
in the process that applied automatic sentence segmentation, which caused
false sentence breaks to be inserted around strings that formed the content of
anchor tags in the original (as harvested) HTML.  In general, the problem
affects blog sources (WL) the most, and news agency sources (NW) the least,
owing to the relative likelihood that content authors will make an effort to
treat some portion of a sentence as the content of an anchor tag.  This flaw
in the segmentation code has been fixed, and most of the data in this release
has been processed into ltf.xml format using the newer version of sentence
segmentation.  (NB: The new version, being automatic, is still not perfect, and
may lead to a slightly higher miss-rate for "true" sentence boundaries, but on
balance, the overall sentence segmentation should be better than with the
earlier version of the process, especially in the WL genre.)

This fix of the sentence segmenter didn't occur until after files had
been selected and sent out for translation, so the English translation
files (and various forms of annotation: named entity, etc.) have been
based on using the earlier (faulty) version of segmentation.  In order
to preserve the alignment between English translations, other
annotations, and the source-language data, the newer segmentation has
NOT been applied to this subset of the data.

There is an additional file in the "docs" directory that lists the file-IDs of
the files where the older segmentation logic has been retained (one file-ID
per line):

   docs/odd_sentence_seg_fileids.txt

The files listed here are the ones where the newer segmentation logic would
have produced a different outcome, but the newer logic has not been applied,
because doing so would disrupt the alignment of the corresponding translation.

10.0 Acknowledgements

This material is based upon work supported by the Defense Advanced
Research Projects Agency (DARPA) under Contract
No. HR0011-15-C-0123. Any opinions, findings and conclusions or
recommendations expressed in this material are those of the author(s)
and do not necessarily reflect the views of DARPA.

11.0 Copyright

Portions © 2012-2016 aqbaar.com, © 2014-2015 BaligubadleMedia.com,
© 2016 BBC, © 2015-2016 BooramaOnline.Com, © 2015 Cadalool, ©
2015-2016 Calayaale, © 2015-2016 Dayaxside, © 2015-2016 Gorgor
Online, © 2014-2016 HAATUF.NET, © 2012-2016 Harowo.com, © 2011-2016
Ilays, © 2014-2016 Jowhar.com, www.jowharnews.so, © 2015-2016
Kismaayo, © 2015 Kismaayonews.com, © 2015-2016 Kulmiye News Network,
© 2010-2016 afweyne.com, © 2015-2016 Maankaab, © 2016 Mareeg.com,
© 2013-2016 Markacadeey, © 2015 Ogadenworld, ©2011-2016 Radio Ergo,
© 2014-2016 Radio Muqdisho, © 2008-2016 SBC, © 2015-2016 Simbanews,
© 2003, 2015 Singapore Press Holdings Ltd. Co. Regn., © 2008-2016
SomaliTalk.com, © 2015-2016 starfmke.com, © 2014-2016 Waaheen Media
Group, © 2015 Warbaahinta Warar.net, © 2016 Warkii.com, © 2010,
2012-2016 Warfaafiye, © 2011-2013, 2016 www.hiiraan.com,© 2014-2016
Yoobsannews.com, © 2016 Trustees of the University of Pennsylvania

12.0 CONTACTS

Jennifer Tracey <garjen@ldc.upenn.edu> - LORELEI Project Manager
Stephanie Strassel <strassel@ldc.upenn.edu> - LORELEI PI