README FILE FOR LDC CATALOG ID: LDC2018T04 TITLE: LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text AUTHORS: Jennifer Tracey, Dave Graff, Stephanie Strassel, Xiaoyi Ma, Jonathan Wright 1.0 Introduction LORELEI Amharic Representative Language Pack, Monolingual and Parallel Text was developed by the Linguistic Data Consortium for the DARPA LORELEI Program and consists of approximately 25 million words of monolingual Amharic text, approximately 600,000 of which are translated into English. Another 80,000 words are also translated from English into Amharic. The LORELEI (Low Resource Languages for Emergent Incidents) Program is concerned with building Human Language Technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs for over 2 dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons and grammatical resources. Representative languages are selected to provide broad typological coverage, while Incident Languages are selected to evaluate system performance on a language whose identity is disclosed at the start of the evaluation, and for which no training data has been provided. This corpus comprises the complete set of monolingual text and parallel text from the LORELEI Amharic Representative Language Pack. The other components of the Amharic Representative Language Pack appear in a separate corpus. For more information about LORELEI language resources, see https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2016-lorelei-language-packs.pdf. 2.0 Corpus organization 2.1 Directory Structure The directory structure and contents of the package are summarized below -- paths shown are relative to the base (root) directory of the package: ./docs/README.txt -- this file ./dtds/ ./dtds/ltf.v1.5.dtd ./dtds/psm.v1.0.dtd ./tools/ ./tools/ltf2txt -- software for extracting raw text from ltf.xml data files ./tools/twitter-processing -- software for conditioning Twitter text data ./data/monolingual_text/zipped/ -- zip-archive files containing monolingual "ltf" and "psm" data ./data/translation/ found/{amh,eng,sentence_alignment} -- found parallel text with sentence alignments between the Amharic and English documents from_amh/{amh,eng}/ -- translations from Amharic to English from_eng/{amh,eng}/ -- translations from English to Amharic for each language in each direction, "ltf" and "psm" directories contain corresponding data files 2.2 File Name Conventions There are 71 *.ltf.zip files in the monolingual_text/zipped directory, together with the same number of *.psm.zip files. Each {ltf,psm}.zip file pair contains an equal number of corresponding data files. The "file-ID" portion of each zip file name corresponds to common substrings in the file names of all the data files contained in that archive. For example: ./data/monolingual_text/zipped/AMH_DF_G00201.ltf.zip contains: ltf/AMH_DF_001617_20150317_G00201JZZ.ltf.xml ltf/AMH_DF_001617_20150317_G00201K00.ltf.xml ... ./data/monolingual_text/zipped/AMH_WL_G0022D.psm.zip contains: psm/AMH_WL_001131_20160312_G0022DDM0.psm.xml psm/AMH_WL_001142_20151113_G0022DDM5.psm.xml ... The file names assigned to individual documents within the zip archive files provide the following information about the document: Language 3-letter abbrev. Genre 2-letter abbrev. Source 6-digit numeric ID assigned to data provider Date 8-digit numeric: YYYYMMDD year, month, day) Global-ID 9-digit alphanumeric assigned to this document Those five fields are joined by underscore characters, yielding a 32-character file-ID; three portions of the document file-ID are used to set the name of the zip file that holds the document: the Language and Genre fields, and the first 6 digits of the Global-ID. The 2-letter codes used for genre are as follows: DF -- discussion forum NW -- news RF -- reference (e.g. Wikipedia) SN -- social network (Twitter) WL -- web-log 3.0 Content Summary 3.1 Monolingual Text Genre #Docs #Tokens DF 614 267392 NW 8768 4581208 RF 6 5555 SN 11589 216905 WL 37343 20050285 Total 58320 25121345 Note that the SN (Twitter) data cannot be distributed directly by LDC, due to the Twitter Terms of Use. The file "docs/twitter_info.tab" (described in Section 7.0 below) provides the necessary information for users to fetch the particular tweets directly from Twitter. 3.2 Parallel Text Type Genre #Docs #Segs #Words Found NW 23 392 11906 FromEng EL 2 3723 14226 FromEng NW 190 3913 64698 ToEng DF 25 1647 19662 ToEng NW 744 19549 240007 ToEng WL 394 31660 335759 Total 1378 60884 686258 4.0 Data Collection and Parallel Text Creation Both monolingual text collection and parallel text creation involve a combination of manual and automatic methods. These methods are described in the sections below. 4.1 Monolingual Text Collection Data is identified for collection by native speaker "data scouts," who search the web for suitable sources, designating individual documents that are in the target language and discuss the topics of interest to the LORELEI program (humanitarian aid and disaster relief). Each document selected for inclusion in the corpus is then harvested, along with the entire website when suitable. Thus the monolingual text collection contains some documents which have been manually selected and/or reviewed and many others which have been automatically harvested and were not subject to manual review. 4.2 Parallel Text Creation Parallel text for LORELEI was created using three different methods, and each LORELEI language may have parallel text from one or all of these methods. In addition to translation from each of the LORELEI languages to English, each language pack contains a "core" set of English documents that were translated into each of the LORELEI Representative Languages. These documents consist of news documents, a phrasebook of conversational sentences, and an elicitation corpus of sentences designed to elicit a variety of grammatical structures. All translations are aligned at the sentence level. For professional and crowdsourced translation, the segments align one-to-one between the source and target language (i.e. segment 1 in the English aligns with segment 1 in the source language). For found parallel text, automatic alignment is performed and a separate alignment file provides information about how the segments in the source and translation are aligned. Professionally translated data has one translation for each source document, while crowdsourced translations have up to four translations for each source document, designated by A, B, C, or D appended to the file name on the multiple translation versions. 5.0 Data Processing and Character Normalization for LORELEI Most of the content has been harvested from various web sources using an automated system that is driven by manual scouting for relevant material. Some content may have been harvested manually, or by means of ad-hoc scripted methods for sources with unusual attributes. All harvested content was initially converted from its original HTML form into a relatively uniform XML format; this stage of conversion eliminated irrelevant content (menus, ads, headers, footers, etc.), and placed the content of interest into a simplified, consistent markup structure. The "homogenized" XML format then served as input for the creation of a reference "raw source data" (rsd) plain text form of the web page content; at this stage, the text was also conditioned to normalize white-space characters, and to apply transliteration and/or other character normalization, as appropriate to the given language. Web-harvested text written in Ethiopic script tends to show free variation in the use of end-of-sentence punctuation, between a single punctuation mark (U+1392 ። "Ethiopic full stop") and two conjoined punctuation marks that yield the same appearance (U+1361 ፡ "Ethiopic wordspace", used twice in succession). In order to simplify automatic sentence segmentation, all occurrences of the latter pattern have been replaced by the former, as part of the normal processing of HTML content into raw source data. 6.0 Overview of XML Data Structures 6.1 PSM.xml -- Primary Source Markup Data The "homogenized" XML format described above preserves the minimum set of tags needed to represent the structure of the relevant text as seen by the human web-page reader. When the text content of the XML file is extracted to create the "rsd" format (which contains no markup at all), the markup structure is preserved in a separate "primary source markup" (psm.xml) file, which enumerates the structural tags in a uniform way, and indicates, by means of character offsets into the rsd.txt file, the spans of text contained within each structural markup element. For example, in a discussion-forum or web-log page, there would be a division of content into the discrete "posts" that make up the given thread, along with "quote" regions and paragraph breaks within each post. After the HTML has been reduced to uniform XML, and the tags and text of the latter format have been separated, information about each structural tag is kept in a psm.xml file, preserving the type of each relevant structural element, along with its essential attributes ("post_author", "date_time", etc.), and the character offsets of the text span comprising its content in the corresponding rsd.txt file. 6.2 LTF.xml -- Logical Text Format Data The "ltf.xml" data format is derived from rsd.txt, and contains a fully segmented and tokenized version of the text content for a given web page. Segments (sentences) and the tokens (words) are marked off by XML tags (SEG and TOKEN), with "id" attributes (which are only unique within a given XML file) and character offset attributes relative to the corresponding rsd.txt file; TOKEN tags have additional attributes to describe the nature of the given word token. The segmentation is intended to partition each text file at sentence boundaries, to the extent that these boundaries are marked explicitly by suitable punctuation in the original source data. To the extent that sentence boundaries cannot be accurately detected (due to variability or ambiguity in the source data), the segmentation process will tend to err more often on the side of missing actual sentence boundaries, and (we hope) less often on the side of asserting false sentence breaks. The tokenization is intended to separate punctuation content from word content, and to segregate special categories of "words" that play particular roles in web-based text (e.g. URLs, email addresses and hashtags). To the extent that word boundaries are not explicitly marked in the source text, the LTF tokenization is intended to divide the raw-text character stream into units that correspond to "words" in the linguistic sense (i.e. basic units of lexical meaning). 7.0 Software tools included in this release 7.1 "ltf2txt" (source code written in Perl) A data file in ltf.xml format (as described above) can be conditioned to recreate exactly the the "raw source data" text stream (the rsd.txt file) from which the LTF was created. The tools described here can be used to apply that conditioning, either to a directory or to a zip archive file containing ltf.xml data. In either case, the scripts validate each output rsd.txt stream by comparing its MD5 checksum against the reference MD5 checksum of the original rsd.txt file from which the LTF was created. (This reference checksum is stored as an attribute of the "DOC" element in the ltf.xml structure; there is also an attribute that stores the character count of the original rsd.txt file.) Each script contains user documentation as part of the script content; you can run "perldoc" to view the documentation as a typical unix man page, or you can simply view the script content directly by whatever means to read the documentation. Also, running either script without any command-line arguments will cause it to display a one-line synopsis of its usage, and then exit. ltf2rsd.perl -- convert ltf.xml files to rsd.txt (raw-source-data) ltfzip2rsd.perl -- extract and convert ltf.xml files from zip archives 7.2 "twitter-processing" (source code written in Ruby) Due to the Twitter Terms of Use, the text content of individual tweets cannot be redistributed by the LDC. As a result, users must download the tweet contents directly from Twitter and condition/normalize the text in a manner equivalent to what was done by the LDC, in order to reproduce the Amharic raw text that was used by LDC for annotation (to be released separately). The twitter-processing software provided in the tools/ directory enables users to perform this normalization and ensure that the user's version of the tweet matches the version used by LDC, by verifying that the md5sum of the user-downloaded and processed tweet matches the md5sum provided in the twitter_info.tab file. Users must have a developer account with Twitter in order to download tweets, and the tool does not replace or circumvent the Twitter API for downloading tweets. The twitter_info.tab file provides the twitter download id for each tweet, along with the LORELEI file name assigned to that tweet and the md5sum of the processed text from the tweet. The file "README.md" in the tools/twitter-processing/ directory provides details on how to install and use the source code in this directory in order to condition text data that the user downloads directly from Twitter and produce both the normalized raw text and the segmented, tokenized LTF.xml output. 8.0 Documentation included in this release The ./docs folder (relative to the root directory of this release) contains four files: char_tally.{lng}.tab - contains tab separated columns: doc uid, number of non-whitespace characters, number of non-whitespace characters in the expected script, and number of anomalous (non-printing) characters for each document in the release source_codes.txt - contains tab-separated columns: genre, source code, source name, and base url for each source in the release twitter_info.tab - contains tab-separated columns: doc uid, tweet id, normalized md5 of the tweet text, and tweet author id for all tweets in the release urls.tab - contains tab-separated columns: doc uid and url. Note that the url column is empty for documents from older releases for which the url is not available; they are included here so that the uids column can serve as a document list for the package. 9.0 KNOWN ISSUES Late in the course of data collection for this language, a flaw was discovered in the process that applied automatic sentence segmentation, which caused false sentence breaks to be inserted around strings that formed the content of anchor tags in the original (as harvested) HTML. In general, the problem affects blog sources (WL) the most, and news agency sources (NW) the least, owing to the relative likelihood that content authors will make an effort to treat some portion of a sentence as the content of an anchor tag. This flaw in the segmentation code has been fixed, and most of the data in this release has been processed into ltf.xml format using the newer version of sentence segmentation. (NB: The new version, being automatic, is still not perfect, and may lead to a slightly higher miss-rate for "true" sentence boundaries, but on balance, the overall sentence segmentation should be better than with the earlier version of the process, especially in the WL genre.) This fix of the sentence segmenter didn't occur until after files had been selected and sent out for translation, so the English translation files (and various forms of annotation: named entity, etc.) have been based on using the earlier (faulty) version of segmentation. In order to preserve the alignment between English translations, other annotations, and the source-language data, the newer segmentation has NOT been applied to this subset of the data. There is an additional file in the "docs" directory that lists the file-IDs of the files where the older segmentation logic has been retained (one file-ID per line): docs/odd_sentence_seg_fileids.txt The files listed here are the ones where the newer segmentation logic would have produced a different outcome, but the newer logic has not been applied, because doing so would disrupt the alignment of the corresponding translation. 10.0 Acknowledgements This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0123. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA. 11.0 Copyright Portions © 2015 Abbay Media'The Ethiopian Information Bank', © 2011-2016 Addis Admass, © 2016 Addis Dimts Radio, © 2015-2016 Addis Media, © 2015 Awramba Times, © 2015-2016 ECADF Ethiopian News, © 2015 Ethio Andinet, © 2016 Ethiopian Broadcasting Corporation, © 2016 Ethiopian Press Agency, © 2015 Ethiopia Prosperous, © 2015-2016 Ethiopian Reporter, © 2014, 2016 Ethiopian Satellite Television (ESAT), © 2016 Fana Broadcasting Corporate (FBC), © 2013-2016 Golgul, © 2012, 2014-2016 Government of Ethiopia, © 2014, 2016 Gudnew, © 2016 Henock Yared, © 2014 HornAffairs, © 2016 Kal Tube, © 2016 Mahibere Kidusan, © 2016 Mereja.com, © 2014-2016 Sendek NewsPaper, © 2015 Sheger FM 102.1, © 2015 Sodere, © 2016 The Ethiopian News Agency, © 2007 The Ethiopianer, © 2015-2016 Tigray, © 2016 Walta Information Center, © 2016 Wazema Radio, © 2016 ZAMI, © 2013, 2015-2016 Zehabesha, © 2016 Trustees of the University of Pennsylvania 12.0 CONTACTS Jennifer Tracey - LORELEI Project Manager Stephanie Strassel - LORELEI PI