README FILE FOR LDC CATALOG ID: LDC2025T16 TITLE: LORELEI Ilocano Incident Language Pack AUTHORS: Jennifer Tracey, Dave Graff, Stephanie Strassel, Michael Arrigo, Jonathan Wright, Ann Bies 1.0 Introduction This corpus contains all the text data, annotations and supplemental resources for the Ilocano language that were used in the DARPA LORELEI / LoReHLT 2019 Evaluation, which was conducted by NIST in August of that year. Detailed information about the corpus content is provided in section 3 for each of the partitions ("sets") in the corpus. Combining all sets, the corpus contains over 8.9 million words of monoligual text in Ilocano and 3.3 million words of monolingual text in English, including 3.2 million words of parallel Ilocano-English text, and a total of 3 million words (1.7m English, 1.3m Ilocano) annotated for Entity Discovery and Linking and Situation Frames. The LORELEI (Low Resource Languages for Emergent Incidents) Program is concerned with building Human Language Technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs for over 2 dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons and grammatical resources. Representative languages are selected to provide broad typological coverage, while Incident Languages are selected to evaluate system performance on a language whose identity is disclosed at the start of the evaluation, and for which no training data has been provided. This package comprises all of the resources and test set references for Ilocano, which was one of the Program's Incident Languages. The evaluation protocol is based on a scenario in which some unforeseen event (the "incident") triggers a need for humanitarian and logistical support in a region where the predominant language (the "incident language") is one that has received little or no attention as yet in NLP research. The objective for evaluation participants is to provide NLP solutions, including information extraction and machine translation, based only on limited resources and with very little time for development. For more information about LORELEI language resources, see https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2016-lorelei-language-packs.pdf. Each incident language pack has one or more focal incidents (a natural disaster or other event which might trigger humanitarian needs). To support the evaluation scenario, the evaluation package contents are divided into the following subsets: set0 : "pre-incident" text data and reference resources for the language, including monolingual text, dictionaries, grammars, and parallel or comparable text (in English and the incident language); monolingual and parallel data in this set includes documents published prior to the beginning of the earliest focal incident and/or reference materials for which publication date is not relevant, such as religious materials setE : "post-incident" text data that forms the basis for scoring NLP system performance (using the scoring protocol and software developed by NIST); set E consists of monolingual text, along with reference translations and annotations setS : "post-incident" text data in English, including information that pertains to the incident itself; this was made available to systems after the initial set of scorable outputs had been submitted set1 : supplemental "post-incident" text data, made available after the initial set of scorable outputs had been submitted Each subset is presented as a directory within the data folder at the top-level of the release package. Tools for data processing are provided as part of set0 only, but are applicable to all sets. 2.0 Corpus organization 2.1 Directory Structure The directory structure and contents of the package are summarized below -- paths shown are relative to the base (root) directory of the package: ./docs/README.txt -- this file ./data/set0/ ./data/set0/tools/ -- software for data file format conversion ./data/set0/data/ -- monolingual and parallel text directories ./data/set0/dtds/ -- DTDs for all .xml data formats ./data/set0/docs/ -- lexical and grammatical resources, information about various set0 components and properties ./data/set1/data/ -- monolingual text ./data/set1/docs/ -- information about various set1 components and properties ./data/setS/data/ -- monolingual text ./data/setS/docs/ -- information about various setS components and properties ./data/setE/data/monolingual_text/ -- monolingual text directory ./data/setE/data/annotation/ il12_edl.tab -- table of Entity-Detection-Linking annotations situation_frame/ -- subdirectories for entity mentions, needs, issues and sentiments tables ./data/setE/data/translation/ eng/ ltf/ -- ltf.xml files psm/ -- psm.xml files il12/ ltf/ -- ltf.xml files psm/ -- psm.xml files ./data/setE/docs/ -- information about various setE components and properties 2.2 File Name Conventions All monolingual text documents are presented as distinct files with unique file names. For convenience, each file name provides a consistent set of information about the content of the file via a set of fixed-width fields, as follows: - Language (3 letters) - Genre (2 letters) - Source (6-digit numeric) - Date (8-digit numeric) - Unique Index Number (9 alpha-numeric characters) The language field for all Ilocano documents uses "IL12" instead of the ISO code for the language, as the practice in LORELEI was to refer to incident languages by numeric identifiers to preserve the secrecy of the language name until the start of the evaluation. The date field for news reports represents the date of original publication for the report. Where possible, discussion forum material uses the date when a given discussion thread was initiated. When date information is not available or meaningful for a given document, the date field will reflect (roughly) the time at which the content was initially collected by the LDC, and may be left "incomplete" by setting the "day" field (last two digits) to zero (e.g. "20140900"). Files containing translations from a source language have the source language identified in the "Language Code" field of the file name, and the translation language as a 3-letter extension that immediately follows the main part of the file name. Pairs of corresponding files in "found" translation may have distinct identifier strings (one with IL12 in the initial file name field, and one with ENG in that field), if they were harvested independently of each other and were later found to contain parallel content. Alternately, some sources of found translation data present their own source and translated text as a single unit, in which case the corresponding pair of files will have a single identifier string, and the English member of the pair will have ".eng" appended. In the former case, the alignment data specifies how the IL12 and ENG files are paired. 2.3 Genres Five genres are represented in this data set, as follows: NW - news and general text harvested from news sites SN - "social network" data (i.e. Twitter) WL - weblog and newsgroup data DF - discussion forum data RF - data from "reference" materials, including religious text, government/NGO information sites, etc. Note that the SN (Twitter) data cannot be distributed directly by LDC, due to the Twitter Terms of Use. Files named "twitter_info.tab" (described in Section 6.0 below, and found in the "docs" directory of sets 0 and E) provide the necessary information for users to fetch the particular tweets directly from Twitter. 3.0 Content Summary 3.1 Set 0 3.1.1 Monolingual text Document and token counts of monolingual text by genre: Genre N_Docs N_Tokens NW 972 464840 RF 6120 4604109 WL 1986 493178 3.1.2 Parallel and comparable text Parallel text document and token count by genre (counts based on Ilocano documents): Genre N_Docs N_Tokens RF 3947 3247144 All parallel text is aligned at the sentence level. Parallel text for Ilocano and Enlgish can be found in set0/data/translation/, which contains the following structure of subdirectories: found/ sentence_alignment/ eng/{ltf,psm}/ il12/{ltf,psm}/ The "found" data set consists of files from web data sources that had parallel text content in Ilocano and English. Each "leaf" directory in the tree (*/ltf, */psm, sentence_alignment) contains a matched set of data files. Parallel file pairs were identified and harvested automatically, processed into LTF.xml format, and then aligned at the level of "segments" (putative sentences). The alignment files (*.align.xml) contain one or more "alignment" elements, in which one or more "source" (English) segments is associated with one or more "translation" (Ilocano) segments. It's not assured that all segments in a given (Ilocano or English) data file are accounted for in a given set of alignments. The sentence alignment files contain references to the source document and the translation document (both files can be found in their respective directories), and multiple "alignment" elements, each of which contains one source element and one translation element. The "segments" attribute of the source and translation element contains space delimited segment ids referring to SEG IDs in the corresponding ltf files. NB: We refer to English as the "source" purely as a matter of convenience and consistency across language packs; we do not have confirmable evidence as to the true original language of a given data file. In fact, for some web data sources, it may be the case that documents were translated from some third language into both English and Ilocano. 3.1.3 Lexical and grammatical resources The set0/docs/ directory contains two subdirectories: categoryI_dictionary/ This directory contains: -- CategoryI_dictionaryinfo_IL12.pdf: a single-page listing URLs of recommended resources for Ilocano dictionary data -- IL12_dictionary.txt: an Ilocano-English translation lexicon with roughly 12,000 entries categoryII/ LORELEI Incident Language packs were required to contain (pointers to) at least 5 of the following 8 "category II" resources: -- bilingual IL-non-English dictionary -- monolingual IL dictionary -- bilingual grammar (reference grammar of the IL in English) -- monolingual grammar in the IL -- monolingual primer (grammar in the IL of the type used by school children) -- bilingual gazetteer -- monolingual gazetteer in the IL -- monolingual gazetteer in English covering the incident region The categoryII directory contains: -- CategoryII_resources_IL12.pdf: information and URLs for available resources -- english_gazetteer.txt: entries drawn from Geonames (www.geonames.org) for the Philippines. Other materials contained in set0/docs: IL12_incident_description.pdf SimpleNamedEntity_English_Guidelines_V1.0.pdf SimpleNamedEntityGuidelines_IL12_V1.0.pdf SituationFrameGuidelines_V5.1.pdf source_codes.tab twitter_info.tab urls.tab 3.2 Set 1 All data in this set is monolingual text in Ilocano from the date of the incident that serves as the focus of the evaluation and later. It may contain some information about the incident, but also contains documents whose content is not relevant to the incident in any way. Genre N_Docs N_Tokens NW 138 25850 WL 131 40912 3.3 Set S All data in this set is monolingual text in English from the date of the incident that serves as the focus of the evaluation and later. It may contain some information about the incident, but also contains documents whose content is not relevant to the incident in any way. Genre N_Docs N_Tokens NW 12 11832 RF 1 4660 WL 4 8391 3.4 Set E 3.4.1 Monolingual Text This data set provides monolingual source data for the LORELEI 2017 Evaluation Test Set in Ilocano and English. All data in this set is from the date of the incident that serves as the focus of the evaluation and later. Ilocano Genre N_Docs N_Tokens NW 185 57359 SN 353 5325 WL 127 34536 total 665 97220 English NW 47 25043 RF 7 3762 SN 598 16612 WL 18 9016 total 670 54433 Because annotations obey the "full-token rule", meaning that all reference annotation extents coincide with token boundaries as provided by the automatic tokenization process, it was deemed to be important for participants in the evaluation to be able to match the LDC's tokenization for Twitter documents that they retrieved directly from the Twitter API. For this reason, in set E only, the monolingual_text directory contains "scrubbed" ltf for Twitter documents. These ltf documents contain none of the actual tweet content, but instead contain a series of underscores and whitespace which allow users to match the tokenization of the tweet via the character offsets provided in the ltf file. 3.4.2 Translation Human reference translations were provided for a subset of the data in the test set. Genre N_Docs N_Tokens NW 266 89842 SN 706 10650 WL 172 44372 total 1144 144864 The translation/ directory under setE/data/ contains source and reference translation files, as follows: il12/{ltf,psm}/ -- contain 572 ltf/psm pairs eng/{ltf,psm}/ -- contain 572 ltf/psm pairs 3.4.3 Annotation Entity Detection and Linking and Situation Frame annotations were applied to a subset of the data, in order to identify "entities", "needs", "issues" and "sentiments" to be detected by systems for scoring purposes: English: Genre N_Docs N_Tokens ------------------------ NW 1753 1140592 RF 399 300981 SN 1472 48680 WL 413 298719 total 4037 1788972 Ilocano: Genre N_Docs N_Tokens ------------------------ NW 2486 1056972 SN 232 4417 WL 745 244736 total 3463 1306125 Some of the files that received annotation did not yield annotatable content for one or more annotation types. The next table shows the number of files containing reference annotations of each type for each genre: Number of Files containing: Lng_Genre Ents Needs Issues Sentiments ---------------------------------------------- ENG_NW 47 33 20 28 ENG_RF 1 7 1 1 ENG_SN 515 101 88 4 ENG_WL 34 7 6 9 IL12_NW 111 80 31 63 IL12_SN 157 5 0 4 IL12_WL 34 24 1 17 The setE/data/annotation/ directory contains subdirectories for "eng" and "il12"; each of these contains a tab delimited file ("eng_edl.tab", "il12_edl.tab") containing the entity linking annotation, along with a set of directories containing situation frame annotation as follows: situation_frame/ -- contains subdirectories for each type: issues/ mentions/ needs/ sentiments/ Situation Frame annotation is designed to extract basic information about where needs (such as a need for food) and relevant issues (such as civil unrest) exist; the information is designed to be of the type that would be useful for planning a disaster response effort. For more detailed information about situation frame annotation, see https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/smerp2017.pdf. Guidelines for both the EDL and Situation Frame tasks are included in the docs/ directory of set0. 4.0 Data Formats The data formats described below are common across all sets. 4.1 PSM - Primary Structural Markup When original data has structural markup interleaved with the language content, we apply a filtering process that, in effect, separates the markup and language content into distinct files. The language content (with white-space normalization) goes into an RSD file (see below), and the relevant markup content goes into a corresponding PSM file, which is a simple XML stream comprising tags with attributes, and no other text content of its own. (Configuring the filter for a given data source involves determining which content and markup are "relevant"; the filter eliminates other content and markup as irrelevant, such as ads, navigation menus, etc.) Each PSM file has a single "psm" tag as its root element, and contains one or more "string" tags. Each "string" refers to some span of text in the corresponding RSD file, using "begin_offset" and "char_length" attributes, and assigns a label to it, using a "type" attribute. (Note that offsets and lengths are expressed as Unicode CHARACTER counts, not byte counts.) The "type" attribute tells what sort of markup tag was used in the original data to contain the given string (e.g. "p", "quote", etc.); when sentence segmentation can be done as part of the filtering step, a "string" tag with type="seg" is used to label the span of each detected sentence. Some structural tags in original data contain attributes that may be relevant to language research; for example, in a file that contains a thread from a discussion forum, it's useful to keep track of the dates and authors of posts within the thread. For these cases, the "string" element can contain one ore more "attribute" elements, to preserve the name and value of the given attribute - e.g.: As shown in this example, the "attribute" tag is also used, where appropriate, to assign an ID value (unique within the file) to each string of a given type; this is also used with the "seg"-type strings to assign IDs to detected sentences. PSM files appear in the data/monolingual_text/ and data/translation/ directories of each set. 4.2 LTF - LORELEI Text Format LTF was originally developed for language packs produced in the REFLEX Program ("LCTL Text Format"). This XML format uses structural tags "SEG" and "TOKEN" to mark sentence segmentation and word tokenization of the source data. The full original text of each sentence (SEG) is contained in an "ORIGINAL_TEXT" tag, and each individual word and punctuation string is contained, in order of occurrence, in a sequence of "TOKEN" elements, along with various attributes for each token. Both SEG and TOKEN attributes include character offsets relative to beginning of the raw source data ("RSD" file format, described below), with the offset of the first character being 0. LTF files appear in the data/monolingual_text/ and data/translation/ directories of each set. 4.3 EDL (Entity Detection and Linking) The file "il12_edl.tab" contains all EDL annotations for the IL12 EDL subset. The table contains eight columns, as follows: column 1: system_run_id -- "LDC" column 2: mention_id column 3: mention_text column 4: extents column 5: kb_id -- numeric-ID or "NIL"+numeric, may contain multiple KB links separated by | ("pipe" symbol) column 6: entity_type column 7: mention_type column 8: confidence When column 5 is fully numeric, it is a citation to a numbered entity in LORELEI Entity Detection and Linking Knowledge Base (distributed separately as LDC2020T10); when it consists of "NIL" plus digits, it refers to an entity that is not present in the Knowledge Base, but this label is used consistently for all mentions of the particular entity. Note that for any annotated Twitter documents, text extents have been replaced by underscore ("_") characters to comply with the prohibition against distributing the text of tweets directly. Character offsets can be used to align the annotations with the tweets once the user has downloaded them using Twitter's API. 4.4 Situation Frame Situation frame annotation consists of three parts, each presented as a separate tab-delimited file: entities, needs, and issues. The details of each table are described below. Entities, mentions, need frames, and issue frames all have IDs that follow a standard schema consisting of a prefix designating the type of ID ('Ent' for entities, 'Men' for mentions, and 'Frame' for both need and issue frames), an alphanumeric string identifying the annotation "kit", and a numeric string uniquely identifying the specific entity, mention, or frame within the document. 4.4.1 Entities The grouping of entity mentions into "selectable entities" for situation frame annotation is provided in the mentions/ subdirectory. The table has 8 columns with the following headers and descriptions: column 1: doc_id -- doc ID of source file for the annotation column 2: entity_id -- unique identifier for each grouped entity column 3: mention_id -- unique identifier for each entity mention column 4: entity_type -- one of PER, ORG, GPE, LOC column 5: mention_status -- 'representative' or 'extra'; representative mentions are the ones which have been chosen by the annotator as the representative name for that entity. Each entity has exactly one representative mention. column 6: start_char -- character offset for the start of the mention column 7: end_char -- character offset for the end of the mention column 8: mention_text -- mention string Again, note that for any annotated Twitter documents, text extents have been replaced by underscore ("_") characters to comply with the prohibition against distributing the text of tweets directly. 4.4.2 Needs Annotation of need frames is provided in the needs/ subdirectory. Each row in the table represents a need frame in the annotated document. The table has 13 columns with the following headers and descriptions: column 1: user_id -- user ID of the annotator column 2: doc_id -- doc ID of source file for the annotation column 3: frame_id -- unique identifier for each frame column 4: frame_type -- 'need' column 5: need_type -- exactly one of 'evac' (evacuation), 'food' (food supply), 'search' (search/rescue), 'utils' (utilities, energy, or sanitation), 'infra' (infrastructure), 'med' (medical assistance), 'shelter' (shelter), or 'water' (water supply) column 6: place_id -- entity ID of the LOC or GPE entity identified as the place associated with the need frame; only one place value per need frame, must match one of the entity IDs in the corresponding ent_output.tsv or be 'none' (indicating no place was named) column 7: proxy_status -- 'True' or 'False' column 8: need_status -- 'current', 'future'(future only), or 'past' (past only) column 9: scope -- one of: none, 1_smallgroup, 2_largegroup, 3_municipality, 4_region column 11: resolution_status -- 'sufficient' or 'insufficient' (insufficient / unknown sufficiency) column 12: reported_by -- entity ID of one or more entities reporting the need; multiple values are comma-separated, must match entity IDs in the corresponding ent_output.tsv or be 'none' column 13: resolved_by -- entity ID of one or more entities resolving the need; multiple values are comma-separated, must match entity IDs in the corresponding ent_output.tsv or be 'none' column 14: description -- string of text entered by the annotator as memory aid during annotation, no requirements for content or language, may be 'none' column 15: kb_id 4.4.3 Issues Annotation of issue frames is provided in the issues/ subdirectory. Each row in the table represents an issue frame in the annotated document. The table has 9 columns with the following headers and descriptions: column 1: user_id -- user ID of the annotator column 2: doc_id -- doc ID of source file for the annotation column 3: frame_id -- unique identifier for each frame column 4: frame_type -- 'issue' column 5: issue_type -- exactly one of 'regimechange' (regime change), 'crimeviolence' (civil unrest or widespread crime), or 'terrorism' (terrorism or other extreme violence) column 6: place_id -- entity ID of the LOC or GPE entity identified as the place associated with the issue frame; only one place value per issue frame, must match one of the entity IDs in the corresponding ent_output.tsv or be 'none' column 7: proxy_status -- 'True' or 'False' column 8: issue_status -- 'current' or 'not_current' column 9: scope -- '1_smallgroup', '2_largegroup', '3_municipality', '4_region' or 'none' column 10: severity -- '1_discomfort', '2_injury', '3_possibledeath', '4_certaindeath' or 'none' column 11: resolution_status -- 'sufficient' or 'insufficient' (insufficient / unknown sufficiency) column 12: reported_by -- entity ID of one or more entities reporting the need; multiple values are comma-separated, must match entity IDs in the corresponding ent_output.tsv or be 'none' column 13: resolved_by -- entity ID of one or more entities resolving the need; multiple values are comma-separated, must match entity IDs in the corresponding ent_output.tsv or be 'none' column 14: description -- string of text entered by the annotator as a memory aid during annotation, no requirements for content or language, may be 'none' column 15: kb_id 4.4.4 Sentiments Annotation of sentiment frames is provided in the sentiments/ subdirectory. Each row in the table represents an issue frame in the annotated document. The table has 8 columns with the following headers and descriptions: column 1: user_id -- user ID of the annotator column 2: doc_id -- doc ID of source file for the annotation column 3: sentiment_value -- numeric between 3 and -3 in increments of 0.5 column 4: polarity -- "positive" or "negative" (correlates with col.3) column 5: emotion_value -- one or more of "fear, anger, joyhappiness, none" column 6: source -- sentiment holder ("author", "other" or entity-ID) column 7: target -- frame-ID value assigned to an issue or need column 8: kb_id -- one of: author, other, none, NIL# or numeric ID from the LORELEI Knowledge Base (LDC2020T10) 4.5 Known gaps in Situation Frame tables As of the current release, each of the following files is known to have one or two empty cells, as shown: eng/situation_frame/needs/ ENG_NW_021064_20180924_J0040VH5S.needs.tab (2 rows lack "description") eng/situation_frame/sentiments/ ENG_NW_020577_20180914_J0040VH2B.sentiments.tab (1 row lacks "source") ENG_SN_000370_20181002_J0T00DFQA.sentiments.tab (1 row lacks "source") ENG_SN_000370_20190507_J0T00J2MO.sentiments.tab (1 row lacks "source") ENG_SN_000370_20190507_J0T00J3HT.sentiments.tab (1 row lacks "source") ENG_SN_000370_20190507_J0T00J6XY.sentiments.tab (1 row lacks "source") ENG_SN_000370_20190507_J0T00J7RG.sentiments.tab (1 row lacks "source") ENG_SN_000370_20190507_J0T00J7RG.sentiments.tab (1 row lacks "source") ENG_SN_000370_20190507_J0T00J8K2.sentiments.tab (1 row lacks "source") ENG_SN_000370_20190507_J0T00J9ID.sentiments.tab (1 row lacks "source") il12/situation_frame/sentiments/ IL12_NW_020891_20181004_J0040X9GL.sentiments.tab (1 row lacks "source") IL12_NW_021121_20181009_J0040W2IE.sentiments.tab (1 row lacks "source") IL12_NW_021121_20190223_J0040X9FA.sentiments.tab (1 row lacks "sentiment_value" and "source") 5.0 Software tools included in this release All software tools are provided in the tools/ directory of Set 0. 5.1 "ltf2txt" (source code written in Perl) A data file in ltf.xml format (as described above) can be conditioned to recreate exactly the the "raw source data" text stream (the rsd.txt file) from which the LTF was created. The tools described here can be used to apply that conditioning, either to a directory or to a zip archive file containing ltf.xml data. In either case, the scripts validate each output rsd.txt stream by comparing its MD5 checksum against the reference MD5 checksum of the original rsd.txt file from which the LTF was created. (This reference checksum is stored as an attribute of the "DOC" element in the ltf.xml structure; there is also an attribute that stores the character count of the original rsd.txt file.) Each script contains user documentation as part of the script content; you can run "perldoc" to view the documentation as a typical unix man page, or you can simply view the script content directly by whatever means to read the documentation. Also, running either script without any command-line arguments will cause it to display a one-line synopsis of its usage, and then exit. ltf2rsd.perl -- convert ltf.xml files to rsd.txt (raw-source-data) ltfzip2rsd.perl -- extract and convert ltf.xml files from zip archives 5.2 "twitter-processing" (source code written in Ruby) Due to the Twitter Terms of Use, the text content of individual tweets cannot be redistributed by the LDC. As a result, users must download the tweet contents directly from Twitter and condition/normalize the text in a manner equivalent to what was done by the LDC, in order to reproduce the Ilocano raw text that was used by LDC for annotation. The twitter-processing software provided in the tools/ directory enables users to perform this normalization and ensure that the user's version of the tweet matches the version used by LDC, by verifying that the md5sum of the user-downloaded and processed tweet matches the md5sum provided in the twitter_info.tab file. Users must have a developer account with Twitter in order to download tweets, and the tool does not replace or circumvent the Twitter API for downloading tweets. The twitter_info.tab file provides the twitter download id for each tweet, along with the LORELEI file name assigned to that tweet and the md5sum of the processed text from the tweet. The file "README.md" in the tools/twitter-processing/ directory provides details on how to install and use the source code in this directory in order to condition text data that the user downloads directly from Twitter and produce both the normalized raw text and the segmented, tokenized LTF.xml output. 5.3 Encoding The common framework for text processing in LORELEI includes a “normalization” step, which allows for rectifying variations in orthography and/or punctuation that may occur with some frequency in this or that particular language. For overall simplicity and consistency in processing across all languages, this normalization step is always invoked; in languages such as Ilocano that require no special normalization, this step leaves the data unchanged. 6.0 Documentation included in this release Each set has its own docs directory, but some file types are consistent across the sets, as described below. IL12_incident_description.pdf: provides a description and additional links and information about the incidents that were the focus of the evaluation data set. Found in set0/docs/ only. EntityLinkingGuidelines_V1.2.1.pdf, SimpleNamedEntityGuidelines_IL12_V1.0.pdf, SimpleNamedEntity_English_Guidelines_V1.0.pdf, SituationFrameGuidelines_V5.1.pdf: guidelines for entity annotation, entity linking, and situation frame annotation. Found in set0/docs/ only. twitter_info.tab: contains tab-separated columns: doc uid, tweet id, normalized md5 of the tweet text, and tweet author id for all tweets in the release. Found in set0 and setE. source_codes.tab: contains tab-separated columns: genre, source code, source name, and base url for each source in the release. Found in all sets. urls.tab: contains tab-separated columns: doc uid and url. Note that the url column is empty for documents from older releases for which the url is not available; they are included here so that the uids column can serve as a document list for the package. Found in all sets. annotated_filelist_EDL_SF.txt, annotated_filelist_MT.txt: list of all files annotated for the Entity Detection & Linking and Situation Frame tasks, and all files with human reference translations. Found in setE only. domain_filelist.tab: lists all documents for which human reference translations were produced and provides a domain judgement: eval_incident (document contains information about the incidents that were the focus of the evaluation), indomain (document is relevant to the overall LORELEI domain of humanitarian assistance and disaster relief and related situations, but not specifically the incident of focus), or nondomain (document is of unspecified topic, not related to the LORELEI domain or incidents). Found in setE/docs/ only. filelist.txt: lists the doc id for all documents in set E. Found in setE/docs/ only. 7.0 Acknowledgements The authors would like to acknowlege the following contributors to this corpus: Song Chen, Dana Delgado, Neville Ryant, Brian Gainor, Neil Kuster, University of Maryland Applied Research Laboratory for Intelligence and Security (ARLIS), formerly UMD Center for Advanced Study of Language (CASL), and our team of Ilocano annotators. This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0123. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA. 8.0 Copyright Portions © 2019 agenparl.eu, © 2018 American Radio Relay League, © 2016-2019 Amianan Balita Ngayon, © 2019 BBC, © 2018-2019 Bombo Radyo Philippines, © 2018-2019 Cable News Network. Turner Broadcasting System, Inc., © 2018 CBC/Radio-Canada, © 2019 Condé Nast, © 2018 Cordillera Peoples Alliance, © 2018 Crisis Group, © 2019 Dow Jones & Company, Inc., © 2018 Edipresse Media Asia Limited, ©2018 Express Newspapers, © 2018 Galadari Printing and Publishing LLC, © 2018 GardaWorld, © 2019 Got Questions Ministries, © 2018 Guardian News and Media Limited or its affiliated companies, © 2017-2019 Ilocos Sentinel - The Forerunner in Weekly News, © 2019 ISA, International Sociological Association, © 2018 Jpost Inc., © 2018 Los Angeles Times, © 2018 mb.com.ph, © 2018 Microsoft, © 2018 Mindanews, © 2019 National Geographic Society|National Geographic Partners, LLC, © 2018 News Pty Limited, © 2005-2019 Northern Dispatch, © 2018 npr, © 2018 Pacific Media Centre, © 2019 POLITICO LLC, © 2018 primer.com.ph, © 2018-2019 Remate News Central, © 2017-2019 RMN Networks, © 2018 Rogers Media, © 2018 South China Morning Post Publishers Ltd., © 2018 Special Broadcasting Service Corporation, © 2018-2019 SunStar Publishing Inc., © 2016-2019 Tawid News Magazine, © 2019 Telegraph Media Group Limited, © 2018 The Irish Times, © 2019 The New York Times Company, © 2018 The Social Justice Foundation, © 2018 The Times of Israel, © 2018 Toronto Star Newspapers Ltd., © 2019 United Methodist Communications, © 2018 United Nations Office for the Coordination of Humanitarian Affairs, © 2018 United Press International, Inc., © 2019 Vox Media, Inc., © 2018 Wells Media Group, Inc., © 2018 Winslow Record, © 2019 Trustees of the University of Pennsylvania 9.0 Contacts If you have questions about this data release, please contact the following personnel at LDC. Jonathan Wright - LORELEI Technical Lead