README FILE FOR LDC CATALOG ID: LDC2024T01 TITLE: LORELEI Farsi Representative Language Pack AUTHORS: Jennifer Tracey, Stephanie Strassel, Dave Graff, Jonathan Wright, Song Chen, Neville Ryant, Seth Kulick, Kira Griffitt, Dana Delgado, Michael Arrigo 1.0 Introduction This corpus was developed by the Linguistic Data Consortium for the DARPA LORELEI Program and consists of about 250 million words of monolingual text in Farsi, over 391,000 words of which have been translated into English, with another 751,000 words for which existing parallel English text was found. It also includes about 120,000 Farsi words translated from English text. About 75,000 words are annotated for simple named entities; over 22,000 words are annotated for full entity (including nominals and pronouns), entity linking, simple semantic annotation, and situation frames. Details about the volume of data for each annotation type are listed in section 3.3 below. The LORELEI (Low Resource Languages for Emergent Incidents) Program is concerned with building Human Language Technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs for over 2 dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons and grammatical resources. Representative languages are selected to provide broad typological coverage, while Incident Languages are selected to evaluate system performance on a language whose identity is disclosed at the start of the evaluation, and for which no training data has been provided. This corpus provides the complete set of monolingual and parallel text, lexicon, annotations, and tools comprising the LORELEI Farsi Representative Language Pack. For more information about LORELEI language resources, see: https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2020-lorelei-language-packs.pdf 2.0 Corpus organization 2.1 Directory Structure The directory structure and contents of the package are summarized below -- paths shown are relative to the base (root) directory of the package: ./dtds/ ./dtds/laf.v1.2.dtd ./dtds/llf.v1.6.dtd ./dtds/ltf.v1.5.dtd ./dtds/psm.v1.0.dtd ./dtds/sentence_alignment.v1.0.dtd ./dtds/cstrans_tab.v1.0.dtd ./docs/ -- various tables and listings (see section 9 below) ./docs/README.txt -- this file ./docs/annotation_guidelines/ -- guidelines for all annotation tasks included in this corpus ./docs/grammatical_sketch/ -- grammatical sketch of Farsi ./docs/cstrans_tab -- supplemental data regarding crowd-source translations ./tools/ -- see section 8 below for details about tools provided ./tools/ldclib/ ./tools/ltf2txt/ ./tools/sent_seg/ ./tools/fas/ne_tagger/ ./tools/fas/transliterator/ ./tools/tokenization_parameters.v5.0.yaml ./data/monolingual_text/zipped/ -- zip-archive files containing monolingual "ltf" and "psm" data ./data/translation/ found/{fas,eng,sentence_alignment} -- found parallel text with sentence alignments between the Farsi and English documents from_fas/{fas,eng}/ -- translations from Farsi to English from_eng/ -- translations from English to Farsi {elicitation,news,phrasebook}/ for each of three types of English data: {fas,eng}/ for each language in each directory, "ltf" and "psm" directories contain corresponding data files ./data/annotation/ -- see section 5 below for details about annotation ./data/annotation/entity/{simple,full}/ ./data/annotation/np_chunking/ ./data/annotation/sem_annotation/ ./data/annotation/situation_frame/{issues,mentions,needs}/ ./data/annotation/twitter_tokenization/ ./data/lexicon/ -- lexicon (llf.xml) and morphological analysis (.tab) files. See section 7.7 for details about the morphological analysis provided by the Johns Hopkins University Unimorph project. 2.2 File Name Conventions The file names assigned to individual documents in this corpus provide the following information about the document: Language 3-letter abbrev. Genre 2-letter abbrev. Source 6-digit numeric ID assigned to data provider Date 8-digit numeric: YYYYMMDD year, month, day) Global-ID 9-digit alphanumeric assigned to this document Those five fields are joined by underscore characters, yielding a 32-character file-ID; three portions of the document file-ID are used to set the name of the zip file that holds the document: the Language and Genre fields, and the first 6 digits of the Global-ID. The 2-letter codes used for genre are as follows: DF -- discussion forum NW -- news RF -- reference (e.g. Wikipedia) SN -- social network (Twitter) WL -- web-log 3.0 Content Summary 3.1 Monolingual Text Genre #Docs #Words DF 298,765 219,193,286 NW 54,156 18,622,450 RF 9 5,928 WL 31,141 11,426,571 SN 5,584 96,412 Note that the SN (Twitter) data cannot be distributed directly by LDC, due to the Twitter Terms of Use. The file "docs/twitter_info.tab" (described in Section 8.2 below) provides the necessary information for users to fetch the particular tweets directly from Twitter. LTF files for all other genres are stored in ./data/monolingual_text/zipped/. 3.2 Parallel Text Type Genre #Docs #Words --- Found NW 1,992 751,048 --- FromEng EL 2 25,502 FromEng NW 190 90,383 --- ToEng DF 415 299,965 ToEng NW 114 43,994 ToEng WL 59 45,370 --- 3.3 Annotation AnnotType Genre #Docs #Words --- SimpleSemantic DF 12 4,235 SimpleSemantic NW 33 11,747 SimpleSemantic SN 173 3,036 SimpleSemantic WL 8 3,028 --- SituationFrame DF 6 2,527 SituationFrame NW 32 11,496 SituationFrame SN 134 2,375 SituationFrame WL 14 5,750 --- EntityFull DF 15 6,803 EntityFull NW 30 10,507 EntityFull SN 175 3,083 EntityFull WL 8 3,299 --- EntitySimp DF 51 17,096 EntitySimp NW 103 37,298 EntitySimp SN 479 8,434 EntitySimp WL 32 12,230 --- NPChunking DF 5 1,544 NPChunking NW 14 5,093 NPChunking SN 59 949 NPChunking WL 4 1,549 4.0 Data Collection and Parallel Text Creation Both monolingual text collection and parallel text creation involve a combination of manual and automatic methods. These methods are described in the sections below. 4.1 Monolingual Text Collection Data is identified for collection by native speaker "data scouts," who search the web for suitable sources, designating individual documents that are in the target language and discuss the topics of interest to the LORELEI program (humanitarian aid and disaster relief). Each document selected for inclusion in the corpus is then harvested, along with the entire website when suitable. Thus the monolingual text collection contains some documents which have been manually selected and/or reviewed and many others which have been automatically harvested and were not subject to manual review. 4.2 Parallel Text Creation Parallel text for LORELEI was created using three different methods, and each LORELEI language may have parallel text from one or all of these methods. In addition to translation from each of the LORELEI languages to English, each language pack contains a "core" set of English documents that were translated into each of the LORELEI Representative Languages. These documents consist of news documents, a phrasebook of conversational sentences, and an elicitation corpus of sentences designed to elicit a variety of grammatical structures. All translations are aligned at the sentence level. For professional and crowdsourced translation, the segments align one-to-one between the source and target language (i.e. segment 1 in the English aligns with segment 1 in the source language). For found parallel text, automatic alignment is performed and a separate alignment file provides information about how the segments in the source and translation are aligned. Professionally translated data has one translation for each source document, while crowdsourced translations have up to four translations for each source document, designated by A, B, C, or D appended to the file name on the multiple translation versions. 5.0 Annotation Six types of annotation are present in this corpus: - Simple Named Entity tags names of persons, organizations, geopolitical entities, and locations (including facilities). - Full Entity also tags nominal and pronominal mentions of entities. - Entity Discovery and Linking provides cross-document coreference of named entities via linking to an external knowledge base (the knowledge base used for LORELEI is released separately as LDC2020T10). - Noun Phrase Chunking identifies the positions and extents of noun phrases. - Simple Semantic Annotation provides light semantic role labeling, capturing acts and states along with their arguments. - Situation Frame annotation labels the presence of needs and issues related to emergent incidents such as natural disasters (e.g. food need, civil unrest), along with information such as location, urgency, and entities involved in resolving the needs. Details about each of these annotation tasks can be found in docs/annotation_guidelines/. SPECIAL NOTE ABOUT ANNOTATIONS ON TWITTER DATA: The LDC cannot redistribute text data from Twitter, and this includes files containing annotation. Where LAF XML and annotation table files have strings of text from other sources, annotations of Twitter data instead have strings with underscores ("_") replacing all non-white-space characters. Software is included in this release that enables users to download a given list of Tweets (assuming the Tweets are still available online), and apply the same conditioning and reformatting that was done by LDC prior to annotation -- see section 8.2 below (ldclib) for more details on the software. In order to confirm that your own download and conditioning yields results that match those of the LDC, we provide a set of LTF XML files (one for each annotated Tweet), in which the text content has been modified by replacing each non-white-space character with an underscore ("_"), so that character offsets are preserved for word tokens and spans of annotations. These "placeholder" LTF XML files are in data/annotation/twitter_tokenization/. 6.0 Data Processing and Character Normalization for LORELEI Most of the content has been harvested from various web sources using an automated system that is driven by manual scouting for relevant material. Some content may have been harvested manually, or by means of ad-hoc scripted methods for sources with unusual attributes. All harvested content was initially converted from its original HTML form into a relatively uniform XML format; this stage of conversion eliminated irrelevant content (menus, ads, headers, footers, etc.), and placed the content of interest into a simplified, consistent markup structure. The "homogenized" XML format then served as input for the creation of a reference "raw source data" (rsd) plain text form of the web page content; at this stage, the text was also conditioned to normalize white-space characters, and to apply transliteration and/or other character normalization, as appropriate to the given language. 7.0 Overview of XML Data Structures 7.1 PSM.xml -- Primary Source Markup Data The "homogenized" XML format described above preserves the minimum set of tags needed to represent the structure of the relevant text as seen by the human web-page reader. When the text content of the XML file is extracted to create the "rsd" format (which contains no markup at all), the markup structure is preserved in a separate "primary source markup" (psm.xml) file, which enumerates the structural tags in a uniform way, and indicates, by means of character offsets into the rsd.txt file, the spans of text contained within each structural markup element. For example, in a discussion-forum or web-log page, there would be a division of content into the discrete "posts" that make up the given thread, along with "quote" regions and paragraph breaks within each post. After the HTML has been reduced to uniform XML, and the tags and text of the latter format have been separated, information about each structural tag is kept in a psm.xml file, preserving the type of each relevant structural element, along with its essential attributes ("post_author", "date_time", etc.), and the character offsets of the text span comprising its content in the corresponding rsd.txt file. 7.2 LTF.xml -- Logical Text Format Data The "ltf.xml" data format is derived from rsd.txt, and contains a fully segmented and tokenized version of the text content for a given web page. Segments (sentences) and the tokens (words) are marked off by XML tags (SEG and TOKEN), with "id" attributes (which are only unique within a given XML file) and character offset attributes relative to the corresponding rsd.txt file; TOKEN tags have additional attributes to describe the nature of the given word token. The segmentation is intended to partition each text file at sentence boundaries, to the extent that these boundaries are marked explicitly by suitable punctuation in the original source data. To the extent that sentence boundaries cannot be accurately detected (due to variability or ambiguity in the source data), the segmentation process will tend to err more often on the side of missing actual sentence boundaries, and (we hope) less often on the side of asserting false sentence breaks. The tokenization is intended to separate punctuation content from word content, and to segregate special categories of "words" that play particular roles in web-based text (e.g. URLs, email addresses and hashtags). To the extent that word boundaries are not explicitly marked in the source text, the LTF tokenization is intended to divide the raw-text character stream into units that correspond to "words" in the linguistic sense (i.e. basic units of lexical meaning). Software is included to convert ltf.xml files to "raw source data" plain text files ("rsd.txt") -- see section 8.1 below. The character offsets used in LTF and LAF xml, and in other types of annotation data, are based on the "rsd.txt" files, which contain just the text that is visible to a person reading the original source, with normalized white-space characters (including line breaks), but without markup of any kind. 7.3 LAF.xml -- Logical Annotation Format Data The "laf.xml" data format provides a generic structure for presenting annotations on the text content of a given ltf.xml file; see the associated DTD file in the "dtds" directory. Note that each type of annotation (simple named entity, full entity, NP-chunking, simple semantic annotation) uses the basic XML elements of LAF in different ways. 7.4 LLF.xml -- LORELEI Lexicon Format Data The "llf.xml" data format is a simple structure for presenting citation-form words (headwords or lemmas) in Farsi, together with Part-Of-Speech (POS) labels and English glosses. Each ENTRY element contains a unique combination of LEMMA value (citation form in native orthography) and POS value, together with one or more GLOSS elements. Each ENTRY has a unique ID, which is included as part of the unique ID assigned to each GLOSS. 7.5 Situation Frame Annotation Tables Situation frame annotation consists of three parts, each presented as a separate tab-delimited file: entities, needs, and issues. The details of each table are described below. Entities, mentions, need frames, and issue frames all have IDs that follow a standard schema consisting of a prefix designating the type of ID ('Ent' for entities, 'Men' for mentions, and 'Frame' for both need and issue frames), an alphanumeric string identifying the annotation "kit", and a numeric string uniquely identifying the specific entity, mention, or frame within the document. 7.5.1 Mentions The grouping of entity mentions into "selectable entities" for situation frame annotation is provided in the mentions/ subdirectory. The table has 8 columns with the following headers and descriptions: column 1: doc_id -- doc ID of source file for the annotation column 2: entity_id -- unique identifier for each grouped entity column 3: mention_id -- unique identifier for each entity mention column 4: entity_type -- one of PER, ORG, GPE, LOC column 5: mention_status -- 'representative' or 'extra'; representative mentions are the ones which have been chosen by the annotator as the representative name for that entity. Each entity has exactly one representative mention. column 6: start_char -- character offset for the start of the mention column 7: end_char -- character offset for the end of the mention column 8: mention_text -- mention string 7.5.2 Needs Annotation of need frames is provided in the needs/ subdirectory. Each row in the table represents a need frame in the annotated document. The table has 13 columns with the following headers and descriptions: column 1: user_id -- user ID of the annotator column 2: doc_id -- doc ID of source file for the annotation column 3: frame_id -- unique identifier for each frame column 4: frame_type -- 'need' column 5: need_type -- exactly one of 'evac' (evacuation), 'food' (food supply), 'search' (search/rescue), 'utils' (utilities, energy, or sanitation), 'infra' (infrastructure), 'med' (medical assistance), 'shelter' (shelter), or 'water' (water supply) column 6: place_id -- entity ID of the LOC or GPE entity identified as the place associated with the need frame; only one place value per need frame, must match one of the entity IDs in the corresponding ent_output.tsv or be 'none' (indicating no place was named) column 7: proxy_status -- 'True' or 'False' column 8: need_status -- 'current', 'future'(future only), or 'past' (past only) column 9: urgency_status -- 'True' (urgent) or 'False' (not urgent) column 10: resolution_status -- 'sufficient' or 'insufficient' (insufficient / unknown sufficiency) column 11: reported_by -- entity ID of one or more entities reporting the need; multiple values are comma-separated, must match entity IDs in the corresponding ent_output.tsv or be 'none' column 12: resolved_by -- entity ID of one or more entities resolving the need; multiple values are comma-separated, must match entity IDs in the corresponding ent_output.tsv or be 'none' column 13: description -- string of text entered by the annotator as memory aid during annotation, no requirements for content or language, may be 'none' 7.5.3 Issues Annotation of issue frames is provided in the issues/ subdirectory. Each row in the table represents an issue frame in the annotated document. The table has 9 columns with the following headers and descriptions: column 1: user_id -- user ID of the annotator column 2: doc_id -- doc ID of source file for the annotation column 3: frame_id -- unique identifier for each frame column 4: frame_type -- 'issue' column 5: issue_type -- exactly one of 'regimechange' (regime change), 'crimeviolence' (civil unrest or widespread crime), or 'terrorism' (terrorism or other extreme violence) column 6: place_id -- entity ID of the LOC or GPE entity identified as the place associated with the issue frame; only one place value per issue frame, must match one of the entity IDs in the corresponding ent_output.tsv or be 'none' column 7: proxy_status -- 'True' or 'False' column 8: issue_status -- 'current' or 'not_current' column 9: description -- string of text entered by the annotator as memory aid during annotation, no requirements for content or language, may be 'none' 7.6 EDL Table The "data/annotation/entity/" directory contains the file "fas_edl.tab", which has an initial "header" line of column names followed by data rows with 8 columns per row. The following shows the column headings and a sample value for each column: column 1: system_run_id LDC column 2: mention_id Men-NW_AFP_ENG_0012_20030419.fas-43 column 3: mention_text ایالت راجستان column 4: extents NW_AFP_ENG_0012_20030419.fas:598-610 column 5: kb_id 1258899 column 6: entity_type GPE column 7: mention_type NAM column 8: confidence 1.0 When column 5 is fully numeric, it refers to a numbered entity in the Reference Knowledge Base (distributed separately as LDC2020T10). Note that a given mention may be ambiguous as to the particular KB element it represents; in this case, two or more numeric KB_ID values will appear in column 5, separated by the vertical-bar character (|). When column 5 consists of "NIL" plus digits, it refers to an entity that is not present in the Knowledge Base, but this label is used consistently for all mentions of the particular entity. 7.7 Morphological Analysis Table The file data/lexicon/fas_morph_analysis.v1.tab contains 12 columns, as follows: column 1: lemid -- numeric lemma identifier column 2: wrdid -- numeric word-form identifier column 3: jhuid -- numeric analysis identifier (unique to each row) column 4: pos -- "macro" part-of-speech label (e.g. "VERB") column 5: cit -- citation form of the lemma column 6: orth -- orthography of the word-form column 7: morph -- detailed POS labeling with segmentation column 8: segs -- segmented orthography of the word-form column 9: tier -- "tier#" classifier column 10: seq -- "ranking" for this analysis column 11: hgloss -- "human readable" gloss column 12: mgloss -- "machine readable" gloss The "lemid" value can be used to look up the given lemma in the "llf.xml" file (by matching the numeric part of the 'id="..."' value of the '" elements, and 21 occurrences in corresponding "" elements"): zip file: FAS_DF_G00227.ltf.zip data file: FAS_DF_002090_20140609_G00227T8E.ltf.xml In some processing environments, this ltf.xml file may trigger error messages like the following: Unicode non-character U+FDEF is illegal for open interchange ... The character in question appears at the beginning of 21 segments in the affected file, suggesting that it was intended as a bullet-point or similar symbol. Apart from (possibly) causing error messages like the one shown above, the presence of this character in the data does not seem to cause any problems for using the data; the "ltf2rsd" extraction process (using one of the "ltf2txt" scripts provided in the "tools" directory) produces the intended character stream, matching the expected size and MD5 checksum of the original raw data. 11.0 Acknowledgements The authors would like to acknowlege the following contributors to this corpus: Brian Gainor, Ann Bies, Justin Mott, Neil Kuster, University of Maryland Applied Research Laboratory for Intelligence and Security (ARLIS), formerly UMD Center for Advanced Study of Language (CASL), and our team of Farsi annotators. This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0123. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA. 12.0 Copyright Portions © 2014 1394 Milestone Middle East Company (Limited), © 2016 Aftab News, © 2002-2007, 2009-2010 Agence France Presse, © 2016 Alalam.ir, © 2013 Al Arabiya, © 2000 American Broadcasting Company, © 2012 Areas Database, hawzah.net, © 2012, 2014-2016 asriran.com, © 2016 Basij News, © 2011-2016 BBC, © 2016 Bohran News Agency, © 2000 Cable News Network LP, LLLP, © 2008 Central News Agency (Taiwan), © 1989 Dow Jones & Company, Inc., © 2016 Drop, © 2016 Entekhab News Agency, © 2014, 2016 Ettelaat Network, © 2013-2016 euronews, © 2016 Farda News, © 2016 Fars News Agency, © 2012-2013, 2016 Geological Survey of Iran, © 2012-2016 ghasemjafari.ir, © 2007-2016 Global Voices, © 2015 Gulf Magazine, © 2012-2015 Iran's Islamic Republic News Agency, © 2014, 2016 Islamic Republic of Iran Broadcasting, © 2016 isna.ir, © 2015 Joya, © 2016 Khaama Press (KP)/ Afghan News Agency, © 2015-2016 Khabaronline News Agency, © 2005 Los Angeles Times - Washington Post News Service, Inc., © 2014 magiran.com, © 2016 Mehr News Agency (www.mehrnews.com), © 2000 National Broadcasting Company, Inc., © 2016 National ID 10103955180 Sun Network, © 2016 Negahe Hasti News, © 1999, 2005, 2006, 2010 New York Times, © 2016 News Agency Islamic Azad University - Anna, © 2016 News - Analytical Sfyraflak, © 2016 News of the Institute for Citizen’s Rights, © 2016 News path, © 2013 Oatmeal online, © 2016 Pars Media News, © 2013 parsinews.ir, © 2016 Parto Tech System, © 2015 Persian News Network, © 2015 Peste, © 2016 Pezeshk.us, © 2011-2013, 2015-2016 PressTV, © 2000 Public Radio International, © 2015 Radiant With You, © 2016 Radio Liberty, © 2010, 2016 Radio Time, © 2014-2015 Salamaneh, © 2016 Shafaqna.com, © 2016 SID, © 2016 Sinapress Science and Culture, © 2012, 2015 Sputnik, © 2015-2016 tabnak.ir, © 2016 Tasnim News Agency, © 2003, 2005-2008, 2010 The Associated Press, © 2016 The World Economy, © 2016 TKG: A public media project of DHSA, © 2003, 2005-2008 Xinhua News Agency, © 2016, 2021 Trustees of the University of Pennsylvania 13.0 CONTACTS Stephanie Strassel - LORELEI PI