README FILE FOR LDC CATALOG ID: LDC20__T__

TITLE: LORELEI Somali Representative Language Pack

AUTHORS: Jennifer Tracey, Stephanie Strassel, Dave Graff, Jonathan
         Wright, Song Chen, Neville Ryant, Seth Kulick, Kira Griffitt,
         Dana Delgado, Michael Arrigo


1.0 Introduction

This corpus was developed by the Linguistic Data Consortium for the DARPA
LORELEI Program and consists of over 13 million words of monolingual text in
Somali, over 800,000 words of which have been translated into English.  It
also includes about 106,000 Somali words translated from English text.  Nearly
73,000 words are annotated for simple named entities, nearly 23,000 words are
annotated for full entity (including nominals and pronouns), and over 10,000
words are covered by noun phrase chunking.  Details about the volume of data
for each annotation type are listed in section 3.3 below.

The LORELEI (Low Resource Languages for Emergent Incidents) Program is
concerned with building Human Language Technology for low resource languages
in the context of emergent situations like natural disasters or disease
outbreaks.  Linguistic resources for LORELEI include Representative Language
Packs for over 2 dozen low resource languages, comprising data, annotations,
basic natural language processing tools, lexicons and grammatical resources.
Representative languages are selected to provide broad typological coverage,
while Incident Languages are selected to evaluate system performance on a
language whose identity is disclosed at the start of the evaluation, and for
which no training data has been provided.

This corpus provides the complete set of monolingual and parallel
text, morphological analysis lexicon, annotations, and tools
comprising the LORELEI Somali Representative Language Pack.  The
present release supersedes and replaces the previously published
corpus: LDC2018T11 - LORELEI Somali Representative Language Pack -
Monolingual and Parallel Text; the main difference relative to that
earlier release is the addition of annotation data.

For more information about LORELEI language resources, see:
https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2020-lorelei-language-packs.pdf


2.0 Corpus organization

2.1 Directory Structure

The directory structure and contents of the package are summarized below --
paths shown are relative to the base (root) directory of the package:

./README.txt  -- this file

./dtds/
./dtds/laf.v1.2.dtd
./dtds/llf.v1.6.dtd
./dtds/ltf.v1.5.dtd
./dtds/psm.v1.0.dtd

./docs/  -- various tables and listings (see section 9 below)
./docs/annotation_guidelines/ -- guidelines for all annotation tasks included in this corpus
./docs/grammatical_sketch/ -- grammatical sketch of Somali

./tools/ -- see section 8 below for details about tools provided
./tools/ldclib/
./tools/ltf2txt/
./tools/sent_seg/
./tools/som/ne_tagger/
./tools/tokenization_parameters.v5.0.yaml

./data/monolingual_text/zipped/ -- zip-archive files containing
monolingual "ltf" and "psm" data

./data/translation/
   from_som/{som,eng}/             -- translations from Somali to English
   from_eng/                       -- translations from English to Somali
    {elicitation,news,phrasebook}/    for each of three types of English data:
     {som,eng}/                       for each language in each directory,
                                      "ltf" and "psm" directories contain
                                      corresponding data files

./data/annotation/ -- see section 5 below for details about annotation
./data/annotation/entity/{simple,full}/
./data/annotation/np_chunking/
./data/annotation/twitter_tokenization/

./data/lexicon/

2.2 File Name Conventions

The file names assigned to individual documents in this corpus provide the
following information about the document:

   Language  3-letter abbrev.
   Genre     2-letter abbrev.
   Source    6-digit numeric ID assigned to data provider
   Date      8-digit numeric: YYYYMMDD year, month, day)
   Global-ID 9-digit alphanumeric assigned to this document

Those five fields are joined by underscore characters, yielding a 32-character
file-ID; three portions of the document file-ID are used to set the name of
the zip file that holds the document: the Language and Genre fields, and the
first 6 digits of the Global-ID.

The 2-letter codes used for genre are as follows:

   DF -- discussion forum
   NW -- news
   RF -- reference (e.g. Wikipedia)
   SN -- social network (Twitter)
   WL -- web-log


3.0 Content Summary

3.1 Monolingual Text

Genre    #Docs      #Words
DF       7,791   2,112,681
NW      29,728   6,285,434
RF           4       4,663
SN       7,493     107,871
WL      16,468   4,969,323

Note that the SN (Twitter) data cannot be distributed directly by LDC, due to
the Twitter Terms of Use.  The file "docs/twitter_info.tab" (described in
Section 8.2 below) provides the necessary information for users to fetch the
particular tweets directly from Twitter.  LTF files for all other genres are
stored in ./data/monolingual_text/zipped/.

3.2 Parallel Text

Type    Genre #Docs   #Words
---
FromEng    EL     2   18,549
FromEng    NW   190   87,438
---
ToEng      DF   977  217,766
ToEng      NW 2,449  448,075
ToEng      WL   374  158,497
---

3.3 Annotation

AnnotType    Genre   #Docs   #Words
---
EntityFull      DF      24    4,411
EntityFull      NW      57   11,220
EntityFull      SN      99    1,462
EntityFull      WL      15    5,799
---
EntitySimp      DF      96   16,344
EntitySimp      NW     216   35,346
EntitySimp      SN     325    4,742
EntitySimp      WL      49   16,449
---
NPChunking      DF      12    1,869
NPChunking      NW      28    5,756
NPChunking      SN      80    1,169
NPChunking      WL       4    1,955
---


4.0 Data Collection and Parallel Text Creation

Both monolingual text collection and parallel text creation involve a
combination of manual and automatic methods.  These methods are described in
the sections below.

4.1 Monolingual Text Collection

Data is identified for collection by native speaker "data scouts," who search
the web for suitable sources, designating individual documents that are in the
target language and discuss the topics of interest to the LORELEI program
(humanitarian aid and disaster relief).  Each document selected for inclusion
in the corpus is then harvested, along with the entire website when suitable.
Thus the monolingual text collection contains some documents which have been
manually selected and/or reviewed and many others which have been
automatically harvested and were not subject to manual review.

4.2 Parallel Text Creation

Parallel text for LORELEI was created using three different methods, and each
LORELEI language may have parallel text from one or all of these methods.  In
addition to translation from each of the LORELEI languages to English, each
language pack contains a "core" set of English documents that were translated
into each of the LORELEI Representative Languages.  These documents consist of
news documents, a phrasebook of conversational sentences, and an elicitation
corpus of sentences designed to elicit a variety of grammatical structures.
All translations are aligned at the sentence level.  For professional and
crowdsourced translation, the segments align one-to-one between the source and
target language (i.e. segment 1 in the English aligns with segment 1 in the
source language).  For found parallel text, automatic alignment is performed
and a separate alignment file provides information about how the segments in
the source and translation are aligned.

Professionally translated data has one translation for each source document,
while crowdsourced translations have up to four translations for each source
document, designated by A, B, C, or D appended to the file name on the
multiple translation versions.


5.0 Annotation

Three types of annotation are present in this corpus:

 - Simple Named Entity tags names of persons, organizations, geopolitical
   entities, and locations (including facilities).

 - Full Entity also tags nominal and pronominal mentions of entities.

 - Noun Phrase Chunking identifies the positions and extents of noun phrases.

Details about each of these annotation tasks can be found in
docs/annotation_guidelines/.

SPECIAL NOTE ABOUT ANNOTATIONS ON TWITTER DATA:

The LDC cannot redistribute text data from Twitter, and this includes files
containing annotation.  Where LAF XML and annotation table files have strings
of text from other sources, annotations of Twitter data instead have strings
with underscores ("_") replacing all non-white-space characters.

Software is included in this release that enables users to download a given
list of Tweets (assuming the Tweets are still available online), and apply the
same conditioning and reformatting that was done by LDC prior to annotation --
see section 8.2 below (ldclib) for more details on the software.

In order to confirm that your own download and conditioning yields results
that match those of the LDC, we provide a set of LTF XML files (one for each
annotated Tweet), in which the text content has been modified by replacing
each non-white-space character with an underscore ("_"), so that character
offsets are preserved for word tokens and spans of annotations.

These "placeholder" LTF XML files are in data/annotation/twitter_tokenization/.


6.0 Data Processing and Character Normalization for LORELEI

Most of the content has been harvested from various web sources using an
automated system that is driven by manual scouting for relevant material.
Some content may have been harvested manually, or by means of ad-hoc scripted
methods for sources with unusual attributes.

All harvested content was initially converted from its original HTML form
into a relatively uniform XML format; this stage of conversion eliminated
irrelevant content (menus, ads, headers, footers, etc.), and placed the
content of interest into a simplified, consistent markup structure.

The "homogenized" XML format then served as input for the creation of a
reference "raw source data" (rsd) plain text form of the web page content; at
this stage, the text was also conditioned to normalize white-space characters,
and to apply transliteration and/or other character normalization, as
appropriate to the given language.


7.0 Overview of XML Data Structures

7.1 PSM.xml -- Primary Source Markup Data

The "homogenized" XML format described above preserves the minimum set of tags
needed to represent the structure of the relevant text as seen by the human
web-page reader.  When the text content of the XML file is extracted to create
the "rsd" format (which contains no markup at all), the markup structure is
preserved in a separate "primary source markup" (psm.xml) file, which
enumerates the structural tags in a uniform way, and indicates, by means of
character offsets into the rsd.txt file, the spans of text contained within
each structural markup element.

For example, in a discussion-forum or web-log page, there would be a division
of content into the discrete "posts" that make up the given thread, along with
"quote" regions and paragraph breaks within each post.  After the HTML has
been reduced to uniform XML, and the tags and text of the latter format have
been separated, information about each structural tag is kept in a psm.xml
file, preserving the type of each relevant structural element, along with its
essential attributes ("post_author", "date_time", etc.), and the character
offsets of the text span comprising its content in the corresponding rsd.txt
file.

7.2 LTF.xml -- Logical Text Format Data

The "ltf.xml" data format is derived from rsd.txt, and contains a fully
segmented and tokenized version of the text content for a given web page.
Segments (sentences) and the tokens (words) are marked off by XML tags (SEG
and TOKEN), with "id" attributes (which are only unique within a given XML
file) and character offset attributes relative to the corresponding rsd.txt
file; TOKEN tags have additional attributes to describe the nature of the
given word token.

The segmentation is intended to partition each text file at sentence
boundaries, to the extent that these boundaries are marked explicitly by
suitable punctuation in the original source data.  To the extent that sentence
boundaries cannot be accurately detected (due to variability or ambiguity in
the source data), the segmentation process will tend to err more often on the
side of missing actual sentence boundaries, and (we hope) less often on the
side of asserting false sentence breaks.

The tokenization is intended to separate punctuation content from word
content, and to segregate special categories of "words" that play particular
roles in web-based text (e.g. URLs, email addresses and hashtags).  To the
extent that word boundaries are not explicitly marked in the source text, the
LTF tokenization is intended to divide the raw-text character stream into
units that correspond to "words" in the linguistic sense (i.e. basic units of
lexical meaning).

Software is included to convert ltf.xml files to "raw source data" plain text
files ("rsd.txt") -- see section 8.1 below.  The character offsets used in LTF
and LAF xml, and in other types of annotation data, are based on the "rsd.txt"
files, which contain just the text that is visible to a person reading the
original source, with normalized white-space characters (including line
breaks), but without markup of any kind.


7.3 LAF.xml -- Logical Annotation Format Data

The "laf.xml" data format provides a generic structure for presenting
annotations on the text content of a given ltf.xml file; see the associated
DTD file in the "dtds" directory.  Note that each type of annotation (simple
named entity, full entity, NP-chunking) uses the basic XML
elements of LAF in different ways.


7.4 Morphological Analysis Table

The file data/lexicon/som_morph_analysis.v1.tab contains 12 columns, as
follows:

column 1:  lemid  -- numeric lemma identifier
column 2:  wrdid  -- numeric word-form identifier
column 3:  jhuid  -- numeric analysis identifier (unique to each row)
column 4:  pos    -- "macro" part-of-speech label (e.g. "VERB")
column 5:  cit    -- citation form of the lemma
column 6:  orth   -- orthography of the word-form
column 7:  morph  -- detailed POS labeling with segmentation
column 8:  segs   -- segmented orthography of the word-form
column 9:  tier   -- "tier#" classifier
column 10: seq    -- "ranking" for this analysis
column 11: hgloss -- "human readable" gloss
column 12: mgloss -- "machine readable" gloss

The morphological analyses for the lexicon entries contained
in this release are automatically generated hypotheses based on
a combination of multiple morphological models and lexical resources,
including curated prototypical and/or irregular forms.

In most cases they have not been manually edited or corrected.
Thus while they have potential value for both table-lookup-based
morphological analysis and morphological system training, they
should not be considered as a comprehensive fully verified ground truth.

The analyses contained in this release were generated prior to 2018.

Updated and more comprehensive data releases (including for many additional
languages) and documentation regarding the Leipzig-based annotation conventions
used in this release may be obtained at http://www.unimorph.org,
the home page of the Johns Hopkins University Unimorph project.


8.0 Software tools included in this release

8.1 "ltf2txt" (source code written in Perl)

A data file in ltf.xml format (as described above) can be conditioned to
recreate exactly the "raw source data" text stream (the rsd.txt file) from
which the LTF was created.  The tools described here can be used to apply that
conditioning, either to a directory or to a zip archive file containing
ltf.xml data.  In either case, the scripts validate each output rsd.txt stream
by comparing its MD5 checksum against the reference MD5 checksum of the
original rsd.txt file from which the LTF was created.  (This reference
checksum is stored as an attribute of the "DOC" element in the ltf.xml
structure; there is also an attribute that stores the character count of the
original rsd.txt file.)

Each script contains user documentation as part of the script content; you can
run "perldoc" to view the documentation as a typical unix man page, or you can
simply view the script content directly by whatever means to read the
documentation.  Also, running either script without any command-line arguments
will cause it to display a one-line synopsis of its usage, and then exit.

   ltf2rsd.perl     -- convert ltf.xml files to rsd.txt (raw-source-data)
   ltfzip2rsd.perl  -- extract and convert ltf.xml files from zip archives

Special note about Twitter data: as explained in section 5 above, this corpus
includes "scrubbed" versions of LTF XML files for individual Tweets, where the
original text characters (except for spaces) are replaced by underscores (in
data/annotation/twitter_tokenization/), in order to comply with Twitter Terms
of Use.  Running "ltf2rsd.perl" directly on these "scrubbed" files will yield
warrnings about MD5 mismatches, which is to be expected, because the MD5 value
stored in each Twitter LTF XML file is based on the original text.  After
using the "ldclib" software (described in the next section) to download and
condition Twitter data, the resulting LTF XML files should have both the
original text and the matching MD5 values; that process also creates the
corresponding rsd.txt files.

8.2 ldclib -- general text conditioning, twitter harvesting

The "bin/" subdirectory of this package contains three executable scripts
(written in Ruby):

   create_rsd.rb -- convert general xml or plain-text formats to "raw source
                    data" (rsd.txt), by removing markup tags and applying
                    sentence segmentation

   token_parse.rb -- convert rsd.txt format into ltf.xml

   get_tweet_by_id.rb -- download and condition Twitter data

Due to the Twitter Terms of Use, the text content of individual tweets cannot
be redistributed by the LDC.  As a result, users must download the tweet
contents directly from Twitter.  The twitter-processing software provided in
the tools/ directory enables users to perform the same normalization applied
by LDC and ensure that the user's version of the tweet matches the version
used by LDC, by verifying that the md5sum of the user-downloaded and processed
tweet matches the md5sum provided in the twitter_info.tab file.  Users must
have a developer account with Twitter in order to download tweets, and the
tool does not replace or circumvent the Twitter API for downloading tweets.

The ./docs/twitter_info.tab file provides the twitter download id for each
tweet, along with the LORELEI file name assigned to that tweet and the md5sum
of the processed text from the tweet.

The file "README.md" in this directory provides details on how to install and
use the source code in this directory in order to condition text data that the
user downloads directly from Twitter and produce both the normalized raw text
and the segmented, tokenized LTF.xml output.

All LDC-developed supporting files (models, configuration files, library
modules, etc.) are included, either in the "lib" subdirectory (next to "bin"),
or else in the parent ("tools") directory.

Please refer to the README.md file that accompanies this software package.

8.3 sent_seg -- apply sentence segmentation to raw text

The Python tools in this directory are used as part of the conditioning done
by "create_rsd.rb" in the "ldclib" package.  Please refer to the README.rst
file included with the package.

8.4 ne_tagger -- Named-Entity tagger for Somali

Please refer to the tools/som/ne_tagger/README.rst file for information about
usage and performance.


9.0 Documentation included in this release

The ./docs folder (relative to the root directory of this release) contains
six files documenting various characteristics of the source data:

source_codes.txt - contains tab-separated columns: genre, source code, source
name, and base url for each source in the release

twitter_info.tab - contains tab-separated columns: doc uid, tweet id,
normalized md5 of the tweet text, and tweet author id for all tweets in the
release

urls.tab - contains tab-separated columns: doc uid and url. Note that the url
column is empty for documents from older releases for which the url is not
available; they are included here so that the uids column can serve as a
document list for the package.

char_tally.SOM.tab - contains tab separated columns: doc uid, number
of non-whitespace characters, number of non-whitespace characters in
the expected script, and number of anomalous (non-printing) characters
for each document in the release

odd_sentence_seg_fileids.txt - lists the file-IDs of files where older
segmentation logic was used to process the data (see section 10.1
below for details)

annotation_lacks_translation.tab - lists file-IDs and annotation
type(s) for any files that were annotated but not part of the
translation set

In addition, the grammatical sketch and annotation guidelines contents
described in earlier sections of this README are found in this directory.


10.0 Known Issues

10.1 Differences in sentence segmentation logic for some data files

Late in the course of data collection for this language, a flaw was discovered
in the process that applied automatic sentence segmentation, which caused
false sentence breaks to be inserted around strings that formed the content of
anchor tags in the original (as harvested) HTML.  In general, the problem
affects blog sources (WL) the most, and news agency sources (NW) the least,
owing to the relative likelihood that content authors will make an effort to
treat some portion of a sentence as the content of an anchor tag.  This flaw
in the segmentation code was fixed, and most of the data in this release
has been processed into ltf.xml format using the newer version of sentence
segmentation.  (NB: The new vesion, being automatic, is still not perfect, and
may lead to a slightly higher miss-rate for "true" sentence boundaries, but on
balance, the overall sentence segmentation should be better than with the
earlier version of the process, especially in the WL genre.)

Unfortunately, this fix of the sentence segmenter didn't occur until after
files had been selected and sent out for translation, so the English
translation files, and various forms of annotation (full entity, simple named
entity, etc.), have been based on using the previous version of segmentation.
In order to preserve the alignment between English translations, other
annotations, and the source-language data, the newer segmentation has NOT been
applied to this subset of the data.

There is a file in the "docs" directory that lists the file-IDs of the files
where the older segmentation logic has been retained (one file-ID per line):

   docs/odd_sentence_seg_fileids.txt

The files listed here are the ones where the newer segmentation logic would
have produced a different outcome, but the newer logic has not been applied,
because doing so would disrupt the alignment of the corresponding translation.


11.0 Acknowledgements

The authors would like to acknowlege the following contributors to this
corpus: Brian Gainor, Ann Bies, Justin Mott, Neil Kuster, University of
Maryland Applied Research Laboratory for Intelligence and Security (ARLIS),
formerly UMD Center for Advanced Study of Language (CASL), and our team of
Somali annotators.

This material is based upon work supported by the Defense Advanced Research
Projects Agency (DARPA) under Contract No. HR0011-15-C-0123.  Any opinions,
findings and conclusions or recommendations expressed in this material are
those of the author(s) and do not necessarily reflect the views of DARPA.


12.0 Copyright

Portions © 2002-2007, 2009-2010 Agence France Presse, © 2000 American
Broadcasting Company, © 2012-2016 Aaqbaar Online, © 2014-2015
BaligubadleMedia.com, © 2016 BBC, © 2000 Cable News Network, LP, LLLP,
© 2015 Cadalool, © 2008 Central News Agency (Taiwan), © 2015-2016
Dayaxside, © 1989 Dow Jones & Company, Inc., © 2014-2016 HAATUF.NET, ©
2014-2016 Jowhar somali news leader, © 2015 Kismaayonews.com, © 2005
Los Angeles Times - Washington Post News Service, Inc., © 2016 Mareeg
Media, © 2013-2016 Markacadeey, © 2000 National Broadcasting Company,
Inc., © 2003, 2015 New Press Media Co., Ltd., © 1999, 2005-2006, 2010
New York Times, © 2015 Ogadenworld, © 2000 Public Radio International,
© 2011-2016 Radio Ergo, © 2015-2016 Radio Kulmiye, © 2014-2016 Radio
Muqdisho, © 2008-2016 SBC, © 2015-2016 Radio Simba News, © 2008-2016
SomaliTalk.com, © 2003, 2005-2008, 2010 The Associated Press, ©
2014-2016 Waaheen Media Group, © 2022 WardheerNews, © 2010, 2012-2016
Warfaafiye, © 2011-2013, 2016 www.hiiraan.com, © 2012-2016
www.rssing.com, © 2014-2016 Yoobsan News, © 2003, 2005-2008 Xinhua
News Agency, © 2016, 2018, 2022 Trustees of the University of
Pennsylvania

13.0 CONTACTS

xStephanie Strassel <strassel@ldc.upenn.edu> - LORELEI PI