README FILE FOR LDC CATALOG ID: LDC2024T03 TITLE: LoReHLT Hausa Representative Language Pack AUTHORS: Jennifer Tracey, Stephanie Strassel, Dave Graff, Jonathan Wright, Song Chen, Neville Ryant, Seth Kulick, Kira Griffitt, Dana Delgado, Michael Arrigo 1.0 Introduction This corpus provides the complete set of monolingual and parallel text, lexicon, annotations, and tools comprising the LoReHLT Hausa Representative Language Pack. It was developed by the Linguistic Data Consortium, and consists of over 4.4 million words of monolingual text in Hausa, about 900,000 words of which have been translated into English. It also includes about 86,000 Hausa words translated from English text. Over 96,000 words received simple named entity annotation, and about 13,700 words received full entity annotation (including nominals and pronouns); varying subsets also underwent noun-phrase chunking, morphology/POS labeling, and simple semantic annotation. Details about the volume of data for each annotation type are listed in section 3.3 below. LoReHLT (Low Resource Human Language Technology) was a companion project of the DARPA LORELEI Program (Low Resource Languages for Emergent Incidents), which was concerned with building Human Language Technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. The present package is the result of a pilot effort preceding the main LORELEI collection project; as such, it has a lot in common with the overall structure of other LORELEI language packs, but also some notable differences (mainly involving file name patterns and the types of annotation done). Linguistic resources for LORELEI include Representative Language Packs for over 2 dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons and grammatical resources. Representative languages are selected to provide broad typological coverage, while Incident Languages are selected to evaluate system performance on a language whose identity is disclosed at the start of the evaluation, and for which no training data has been provided. For more information about LORELEI language resources, see: https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2020-lorelei-language-packs.pdf 2.0 Corpus organization 2.1 Directory Structure The directory structure and contents of the package are summarized below -- paths shown are relative to the base (root) directory of the package: ./dtds/ ./dtds/laf.v1.2.dtd ./dtds/llf.v1.6.dtd ./dtds/ltf.v1.5.dtd ./dtds/psm.v1.0.dtd ./docs/ -- contains this README, plus various tables and listings (see section 9 below) ./docs/annotation_guidelines/ -- guidelines for all annotation tasks included in this corpus ./docs/grammatical_sketch/ -- grammatical sketch of Hausa ./tools/ -- see section 8 below for details about tools provided ./tools/ltf2txt/ ./tools/ne_tagger/ ./tools/sentence_segmenter/ ./tools/tokenizer_analyzer/ ./tools/twitter_processing/ ./data/monolingual_text/zipped/ -- zip archives of ltf and psm files ./data/translation/ from_hau/{hau,eng}/ -- translations from Hausa to English from_eng/ -- translations from English to Hausa {elicitation,news,phrasebook}/ for each of three types of English data: {hau,eng}/ for each language in each directory, "ltf" and "psm" directories contain corresponding data files ./data/annotation/ -- see section 5 below for details about annotation ./data/annotation/entity/{simple,full}/ ./data/annotation/np_chunking/ ./data/annotation/sem_annotation/ ./data/annotation/twitter_tokenization/ ./data/audio/ -- 20 mp4 files containing audio tracks from YouTube videos ./data/lexicon/ -- llf.xml lexicon and table of morphological analyses 2.2 File Name Conventions The file names assigned to individual documents in this corpus provide the following information about the document -- note that the LoReHLT naming conventions differ from those of the later LORELEI packages: Genre 2-letter abbrev. Source 3-letter label assigned to data provider Language 3-letter abbrev. Index# 6-digit numeric assigned to this document Date 8-digit numeric: YYYYMMDD year, month, day) Those five fields are joined by underscore characters, yielding a 26-character file-ID. The 2-letter codes used for genre are as follows: NW -- news RF -- reference (e.g. Wikipedia) SN -- social network (Twitter text, YouTube audio) WL -- web-log In the "./data/monolingual_text/zipped/" directory, all documents in a given genre have been placed together into a single zip archive, as follows: NW_ALL_HAU.{ltf,psm}.zip RF_WKP_HAU.{ltf,psm}.zip (all "reference" docs are from Wikipedia) WL_ALL_HAU.{ltf,psm}.zip The number of data files per zip archive ranges from 698 to 11,676. 3.0 Content Summary 3.1 Monolingual Text Genre #Docs #Words NW 11,676 3,314,573 RF 1,281 86,617 WL 698 1,009,093 SN 2,032 24,866 Note that the SN (Twitter) data cannot be distributed directly by LDC, due to the Twitter Terms of Use. The file "docs/twitter_info.tab" (described in Section 8.2 below) provides the necessary information for users to fetch the particular tweets directly from Twitter. LTF files for all other genres are stored in ./data/monolingual_text/zipped/. 3.2 Parallel Text Type Genre #Docs #Words --- FromEng EL 2 14,927 FromEng NW 198 71,494 --- ToEng NW 1,781 444,123 ToEng SN 2,032 24,855 ToEng WL 478 435,801 --- Again, because LDC cannot distribute original Twitter data, we present only the English translations for 2,032 Tweets: SN_TWT_HAU_*.ltf.xml files exist under "./data/translation/from_hau/eng/" only. Note that the SN file inventory was originally organized in groups, such that each group was assigned a distinct 6-digit index number for the 4th field of the file name, and held up to 30 Tweets. In order to present each translated Tweet as a separate data file, we have appended an additional 2-digit index number at the end of each file name -- e.g.: SN_TWT_HAU_007297_20141120-00.eng.ltf.xml SN_TWT_HAU_007297_20141120-01.eng.ltf.xml ... SN_TWT_HAU_007297_20141120-29.eng.ltf.xml SN_TWT_HAU_007298_20141120-00.eng.ltf.xml SN_TWT_HAU_007298_20141120-01.eng.ltf.xml ... SN_TWT_HAU_015681_20150413-15.eng.ltf.xml Each full tweet is presented as the sole element in one ltf.xml file. There are no paragraph or sentence boundaries in twitter text, so there are no SN_TWT_*.psm.xml files. 3.3 Annotation AnnotationType Genre #Docs #Words --- EntityFull NW 63 13,797 --- EntitySimp NW 210 48,791 EntitySimp SN 1,234 15,541 EntitySimp WL 62 32,654 --- NPChunking NW 27 4,888 NPChunking SN 48 653 NPChunking WL 5 1,932 --- SimpleSemantic NW 32 7,088 SimpleSemantic SN 59 844 SimpleSemantic WL 5 1,715 3.4 Audio The goal was to identify amateur recordings related to disaster events that also had coverage in the text data. Annotators searched for suitable data online, selected individual recordings for inclusion in the corpus based on criteria like topic and date range, and did topic annotation on the recordings. No additional annotation was done on this data as it was intended as a supplement to the primary language pack data. 4.0 Data Collection and Parallel Text Creation Both monolingual text collection and parallel text creation involve a combination of manual and automatic methods. These methods are described in the sections below. 4.1 Monolingual Text Collection Data is identified for collection by native speaker "data scouts," who search the web for suitable sources, designating individual documents that are in the target language and discuss the topics of interest to the LORELEI program (humanitarian aid and disaster relief). Each document selected for inclusion in the corpus is then harvested, along with the entire website when suitable. Thus the monolingual text collection contains some documents which have been manually selected and/or reviewed and many others which have been automatically harvested and were not subject to manual review. 4.2 Parallel Text Creation Parallel text for LORELEI was created using three different methods, and each LORELEI language may have parallel text from one or all of these methods. In addition to translation from each of the LORELEI languages to English, each language pack contains a "core" set of English documents that were translated into each of the LORELEI Representative Languages. These documents consist of news documents, a phrasebook of conversational sentences, and an elicitation corpus of sentences designed to elicit a variety of grammatical structures. All translations are aligned at the sentence level. For professional and crowdsourced translation, the segments align one-to-one between the source and target language (i.e. segment 1 in the English aligns with segment 1 in the source language). For found parallel text, automatic alignment is performed and a separate alignment file provides information about how the segments in the source and translation are aligned. Professionally translated data has one translation for each source document, while crowdsourced translations have up to four translations for each source document, designated by A, B, C, or D appended to the file name on the multiple translation versions. 5.0 Annotation Five types of annotation are present in this corpus: - Simple Named Entity tags names of persons, organizations, geopolitical entities, and locations (including facilities). - Full Entity also tags nominal and pronominal mentions of entities. - Noun Phrase Chunking identifies the positions and extents of noun phrases. - Simple Semantic Annotation provides light semantic role labeling, capturing acts and states along with their arguments. - Morphological Segmentation provides a list of word forms segmented into morphemes (found in data/lexicon/tur_wordform_morph_analysis.tab). Results of the first four annotation types are stored in LAF XML format (see section 7.3 below), with annotations for one document in each XML file. The morphological segmentation is presented as a table with three tab-delimited columns: the word token, its part-of-speech label, and the set of morphological segments separated by spaces. Details about each of these annotation tasks can be found in docs/annotation_guidelines/. SPECIAL NOTE ABOUT ANNOTATIONS ON TWITTER DATA: The LDC cannot redistribute text data from Twitter, and this includes files containing annotation. Where LAF XML and annotation table files have strings of text from other sources, annotations of Twitter data instead have strings with underscores ("_") replacing all non-white-space characters. Software is included in this release that enables users to download a given list of Tweets (assuming the Tweets are still available online), and apply the same conditioning and reformatting that was done by LDC prior to annotation -- see section 8.2 below for more details on the software. In order to confirm that your own download and conditioning yields results that match those of the LDC, we provide a set of LTF XML files (one for each annotated Tweet), in which the text content has been modified by replacing each non-white-space character with an underscore ("_"), so that character offsets are preserved for word tokens and spans of annotations. These "placeholder" LTF XML files are in data/annotation/twitter_tokenization/. 6.0 Data Processing and Character Normalization for LORELEI Most of the content has been harvested from various web sources using an automated system that is driven by manual scouting for relevant material. Some content may have been harvested manually, or by means of ad-hoc scripted methods for sources with unusual attributes. All harvested content was initially converted from its original HTML form into a relatively uniform XML format; this stage of conversion eliminated irrelevant content (menus, ads, headers, footers, etc.), and placed the content of interest into a simplified, consistent markup structure. The "homogenized" XML format then served as input for the creation of a reference "raw source data" (rsd) plain text form of the web page content; at this stage, the text was also conditioned to normalize white-space characters, and to apply transliteration and/or other character normalization, as appropriate to the given language. 7.0 Overview of XML Data Structures 7.1 PSM.xml -- Primary Source Markup Data The "homogenized" XML format described above preserves the minimum set of tags needed to represent the structure of the relevant text as seen by the human web-page reader. When the text content of the XML file is extracted to create the "rsd" format (which contains no markup at all), the markup structure is preserved in a separate "primary source markup" (psm.xml) file, which enumerates the structural tags in a uniform way, and indicates, by means of character offsets into the rsd.txt file, the spans of text contained within each structural markup element. For example, in a discussion-forum or web-log page, there would be a division of content into the discrete "posts" that make up the given thread, along with "quote" regions and paragraph breaks within each post. After the HTML has been reduced to uniform XML, and the tags and text of the latter format have been separated, information about each structural tag is kept in a psm.xml file, preserving the type of each relevant structural element, along with its essential attributes ("post_author", "date_time", etc.), and the character offsets of the text span comprising its content in the corresponding rsd.txt file. 7.2 LTF.xml -- Logical Text Format Data The "ltf.xml" data format is derived from rsd.txt, and contains a fully segmented and tokenized version of the text content for a given web page. Segments (sentences) and the tokens (words) are marked off by XML tags (SEG and TOKEN), with "id" attributes (which are only unique within a given XML file) and character offset attributes relative to the corresponding rsd.txt file; TOKEN tags have additional attributes to describe the nature of the given word token. The segmentation is intended to partition each text file at sentence boundaries, to the extent that these boundaries are marked explicitly by suitable punctuation in the original source data. To the extent that sentence boundaries cannot be accurately detected (due to variability or ambiguity in the source data), the segmentation process will tend to err more often on the side of missing actual sentence boundaries, and (we hope) less often on the side of asserting false sentence breaks. The tokenization is intended to separate punctuation content from word content, and to segregate special categories of "words" that play particular roles in web-based text (e.g. URLs, email addresses and hashtags). To the extent that word boundaries are not explicitly marked in the source text, the LTF tokenization is intended to divide the raw-text character stream into units that correspond to "words" in the linguistic sense (i.e. basic units of lexical meaning). Software is included to convert ltf.xml files to "raw source data" plain text files ("rsd.txt") -- see section 8.1 below. The character offsets used in LTF and LAF xml, and in other types of annotation data, are based on the "rsd.txt" files, which contain just the text that is visible to a person reading the original source, with normalized white-space characters (including line breaks), but without markup of any kind. 7.3 LAF.xml -- Logical Annotation Format Data The "laf.xml" data format provides a generic structure for presenting annotations on the text content of a given ltf.xml file; see the associated DTD file in the "dtds" directory. Note that each type of annotation (simple named entity, full entity, simple semantic annotation) uses the basic XML elements of LAF in different ways. 7.4 LLF.xml -- LORELEI Lexicon Format Data The "llf.xml" data format is a simple structure for presenting citation-form words (headwords or lemmas) in Hausa, together with Part-Of-Speech (POS) labels and English glosses. Each ENTRY element contains a unique combination of LEMMA value (citation form in native orthography) and POS value, together with one or more GLOSS elements. Each ENTRY has a unique ID, which is included as part of the unique ID assigned to each GLOSS. 8.0 Software tools included in this release Each of the software components summarized below contains its own README file or other documentation, which should be consulted for more detailed usage information. Note that the versions of software provided here are consistent with the original package release to LORELEI project participants in 2015; in later LORELEI releases, software was updated and reorganized to handle various changes in corpus handling and design (e.g. to use a different file name format). This software is being provided in hopes that it will be informative, but with no guarantee as to its usability. 8.1 "ltf2txt" (source code written in Perl) A data file in ltf.xml format (as described above) can be conditioned to recreate exactly the "raw source data" text stream (the rsd.txt file) from which the LTF was created. The tools described here can be used to apply that conditioning, either to a directory or to a zip archive file containing ltf.xml data. In either case, the scripts validate each output rsd.txt stream by comparing its MD5 checksum against the reference MD5 checksum of the original rsd.txt file from which the LTF was created. (This reference checksum is stored as an attribute of the "DOC" element in the ltf.xml structure; there is also an attribute that stores the character count of the original rsd.txt file.) Each script contains user documentation as part of the script content; you can run "perldoc" to view the documentation as a typical unix man page, or you can simply view the script content directly by whatever means to read the documentation. Also, running either script without any command-line arguments will cause it to display a one-line synopsis of its usage, and then exit. ltf2rsd.perl -- convert ltf.xml files to rsd.txt (raw-source-data) ltfzip2rsd.perl -- extract and convert ltf.xml files from zip archives Special note about Twitter data: as explained in section 5 above, this corpus includes "scrubbed" versions of LTF XML files for individual Tweets, where the original text characters (except for spaces) are replaced by underscores (in data/annotation/twitter_tokenization/), in order to comply with Twitter Terms of Use. Running "ltf2rsd.perl" directly on these "scrubbed" files will yield warrnings about MD5 mismatches, which is to be expected, because the MD5 value stored in each Twitter LTF XML file is based on the original text. After using the "ldclib" software (described in the next section) to download and condition Twitter data, the resulting LTF XML files should have both the original text and the matching MD5 values; that process also creates the corresponding rsd.txt files. 8.2 twitter_processing This directory contains a README file, and executable script written in Ruby, and supporting files (Gemfile and a lib/ directory). Refer to the README file for details on using these scripts. Due to the Twitter Terms of Use, the text content of individual tweets cannot be redistributed by the LDC. As a result, users must download the tweet contents directly from Twitter. The twitter-processing software provided in the tools/ directory enables users to perform the same normalization applied by LDC and ensure that the user's version of the tweet matches the version used by LDC, by verifying that the md5sum of the user-downloaded and processed tweet matches the md5sum provided in the twitter_info.tab file. Users must have a developer account with Twitter in order to download tweets, and the tool does not replace or circumvent the Twitter API for downloading tweets. The ./docs/twitter_info.tab file provides the twitter download id for each tweet, along with the LORELEI file name assigned to that tweet, the numeric ID of the tweet author, and the md5sum of the processed text from the tweet. 8.3 sentence_segmenter -- apply sentence segmentation to raw text The Python and Ruby scripts in this directory are used to apply sentence boundary detection to text. Please refer to the README.txt file included with the package. 8.4 ne_tagger -- Named-Entity tagger for Hausa Please refer to the tools/ne_tagger/README.txt file for information about usage and performance. 8.5 tokenizer_analyzer -- for creating LTF.XML format There are two README files in this directory to explain the installation and usage of the software: tools/tokenizer_analyzer/README.txt tools/tokenizer_analyzer/ldclib/README.md 9.0 Documentation included in this release The ./docs folder (relative to the root directory of this release) contains the following: audio_info.tab - lists the 20 *.mp4 files in ./data/audio/, showing their channel count, sample rate, duration, and topic(s). elicitation_template.txt - lists the 2600 English phrases used to create the elicitation portion of the English-to-Hausa translation data. The file is organized as a sequence of blank-line-separated "paragraphs", with each paragraph containing the segment-ID (i.e. "segment-0" .. "segment-2599"), the English sentence to be translated into the target language, and supplemental context information (if any) to guide the translation. hau_morph_analysis_files.txt - lists the names of the 40 files that underwent manual morphological analysis and part-of-speech tagging, with humans correcting automatic analysis. The morphological and POS annotations are included as attributes in the LTF data format as described above. Full details on the tagset used are found in the annotation_guidelines directory. Note that there may be discrepancies between this tagset and the categories described in the grammatical sketch. source_codes.txt - a four-column table listing the distinct 3-letter codes that identify data sources: values from the 2nd field of data file names (e.g. "VOA") appear in the 2nd column of this table. For each source, the first column shows the genre; the 3rd and 4th columns contain the full name and base URL of the source, if available (otherwise "n/a"). Because data collection for this package was done as a pilot project for LORELEI, base URLs were retained only for NW sources (not for DF or WL), and full URLs for individual document files were not recorded (so unlike the later LORELEI language packs, there is no "urls.tab" file). twitter_info.tab - contains tab-separated columns: doc uid, tweet id, normalized md5 of the tweet text, and tweet author id for all tweets designated for use in this language pack. In addition, the grammatical sketch and annotation guidelines mentioned in earlier sections of this README are found in this directory. 10.0 Acknowledgements The authors would like to acknowledge the following contributors to this corpus: Brian Gainor, Ann Bies, Justin Mott, Neil Kuster, University of Maryland Applied Research Laboratory for Intelligence and Security (ARLIS), formerly UMD Center for Advanced Study of Language (CASL), and our team of Hausa annotators. This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0123. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA. 12.0 Copyright 13.0 CONTACTS Jennifer Tracey - LORELEI Project Manager Stephanie Strassel - LORELEI PI