README FILE FOR LDC CATALOG ID: LDC2020Tnn TITLE: LORELEI Bengali Representative Language Pack AUTHORS: Jennifer Tracey, Stephanie Strassel, Dave Graff, Jonathan Wright, Song Chen, Neville Ryant, Seth Kulick, Kira Griffitt, Dana Delgado, Michael Arrigo 1.0 Introduction This corpus was developed by the Linguistic Data Consortium for the DARPA LORELEI Program and consists of over 144 million words of monolingual Bengali text, approximately 358,000 words of which are translated into English. Another 96,000 Bengali words are also translated from English data, and 2 million words of found parallel text are included. Approximately 86,000 words are annotated for named entities, and up to 25,000 words with several additional types of annotation (full entity including nominals and pronouns, simple semantic annotation, situation frame annotation, entity linking, and noun phrase chunking). Details of data volumes for each type of annotation are provided in section 3 of this README. The LORELEI (Low Resource Languages for Emergent Incidents) Program is concerned with building Human Language Technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs for over 2 dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons and grammatical resources. Representative languages are selected to provide broad typological coverage, while Incident Languages are selected to evaluate system performance on a language whose identity is disclosed at the start of the evaluation, and for which no training data has been provided. This corpus comprises the complete set of monolingual and parallel text, lexicon, annotations, and tools from the LORELEI Bengali Representative Language Pack. For more information about LORELEI language resources, see https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2020-lorelei-language-packs.pdf. 2.0 Corpus organization 2.1 Directory Structure The directory structure and contents of the package are summarized below -- paths shown are relative to the base (root) directory of the package: ./dtds/ ./dtds/ltf.v1.5.dtd ./dtds/psm.v1.0.dtd ./dtds/sentence_alignment.v1.0.dtd ./dtds/cstrans_tab.v1.0.dtd ./dtds/laf.v1.2.dtd ./dtds/llf.v1.6.dtd ./docs/ -- various tables and listings (see section 9 below) ./docs/README.txt -- this file ./docs/cstrans_tab/ -- supplemental data regarding crowd-source translations ./docs/annotation_guidelines/ -- guidelines for all annotation tasks included in this corpus ./docs/grammatical_sketch/ -- grammatical sketch of Bengali ./tools/ -- see section 8 below for details about tools provided ./tools/ldclib/ ./tools/ltf2txt/ ./tools/sent_seg/ ./tools/tokenization_parameters.v5.0.yaml ./tools/ben/ ./data/monolingual_text/zipped/ -- zip-archive files containing monolingual "ltf" and "psm" data ./data/translation/ found/{ben,eng,sentence_alignment} -- found parallel text with sentence alignments between the Bengali and English documents from_ben/{ben,eng}/ -- translations from Bengali to English from_eng/ -- translations from English to Bengali {elicitation,news,phrasebook}/ for each of three types of English data: {ben,eng}/ for each language in each directory, "ltf" and "psm" directories contain corresponding data files ./data/annotation/ -- see section 5 below for details about annotation ./data/annotation/entity/ ./data/annotation/sem_annotation/ ./data/annotation/situation_frame/ ./data/annotation/twitter_tokenization/ ./data/lexicon/ 2.2 File Name Conventions There are 93 *.ltf.zip files in the monolingual_text/zipped directory, together with the same number of *.psm.zip files. Each {ltf,psm}.zip file pair contains an equal number of corresponding data files. The "file-ID" portion of each zip file name corresponds to common substrings in the file names of all the data files contained in that archive. For example: ./data/monolingual_text/zipped/BEN_NW_70R002.ltf.zip contains: ltf/BEN_NW_000093_20050929_70R002FVS.ltf.xml ltf/BEN_NW_000093_20051019_70R002PDA.ltf.xml ... ./data/monolingual_text/zipped/BEN_NW_70R002.psm.zip contains: psm/BEN_NW_000093_20050929_70R002FVS.psm.xml psm/BEN_NW_000093_20051019_70R002PDA.psm.xml ... The file names assigned to individual documents within the zip archive files provide the following information about the document: Language 3-letter abbrev. Genre 2-letter abbrev. Source 6-digit numeric ID assigned to data provider Date 8-digit numeric: YYYYMMDD year, month, day) Global-ID 9-digit alphanumeric assigned to this document Those five fields are joined by underscore characters, yielding a 32-character file-ID; three portions of the document file-ID are used to set the name of the zip file that holds the document: the Language and Genre fields, and the first 6 digits of the Global-ID. The 2-letter codes used for genre are as follows: NW -- news SN -- social network (Twitter) WL -- web-log 3.0 Content Summary 3.1 Monolingual Text Genre #Docs #Words NW 172,318 36,725,034 SN 8,931 98,018 WL 96,695 117,695,470 Total 277,944 154,518,522 Note that the SN (Twitter) data cannot be distributed directly by LDC, due to the Twitter Terms of Use. The file "docs/twitter_info.tab" (described in Section 8.2 below) provides the necessary information for users to fetch the particular tweets directly from Twitter. 3.2 Parallel Text Type Genre #Docs #Segs #Words Found NW 4,761 137,340 1,948,814 Found WL 62 1,757 23,050 FromEng EL 2 3,723 18,665 FromEng NW 196 3,977 77,788 ToEng NW 111 2,700 33,012 ToEng WL 479 30,433 324,497 Total 5,611 179,930 2,425,826 Crowd-Source Translation Versions: Version #Docs _A 122 _B 118 _C 98 _D 61 3.3 Annotation AnnotType Genre #Docs #Segs #Words --- SimpleSemantic NW 64 1,022 13,583 SimpleSemantic SN 225 225 2,281 SimpleSemantic WL 26 696 8,087 --- SituationFrame NW 64 1,022 13,583 SituationFrame SN 225 225 2,281 SituationFrame WL 25 676 7,857 --- EntityFull NW 29 495 6,493 EntityFull SN 132 132 1,302 EntityFull WL 10 255 2,621 --- EntitySimp NW 203 3714 66,135 EntitySimp SN 862 862 8,808 EntitySimp WL 32 851 9,642 --- EntityLinking NW 29 495 6,493 EntityLinking SN 132 132 1,302 EntityLinking WL 10 255 2,621 --- 4.0 Data Collection and Parallel Text Creation Both monolingual text collection and parallel text creation involve a combination of manual and automatic methods. These methods are described in the sections below. 4.1 Monolingual Text Collection Data is identified for collection by native speaker "data scouts," who search the web for suitable sources, designating individual documents that are in the target language and discuss the topics of interest to the LORELEI program (humanitarian aid and disaster relief). Each document selected for inclusion in the corpus is then harvested, along with the entire website when suitable. Thus the monolingual text collection contains some documents which have been manually selected and/or reviewed and many others which have been automatically harvested and were not subject to manual review. 4.2 Parallel Text Creation Parallel text for LORELEI was created using three different methods, and each LORELEI language may have parallel text from one or all of these methods. In addition to translation from each of the LORELEI languages to English, each language pack contains a "core" set of English documents that were translated into each of the LORELEI Representative Languages. These documents consist of news documents, a phrasebook of conversational sentences, and an elicitation corpus of sentences designed to elicit a variety of grammatical structures. All translations are aligned at the sentence level. For professional and crowdsourced translation, the segments align one-to-one between the source and target language (i.e. segment 1 in the English aligns with segment 1 in the source language). For found parallel text, automatic alignment is performed and a separate alignment file provides information about how the segments in the source and translation are aligned. Professionally translated data has one translation for each source document, while crowdsourced translations have up to four translations for each source document, designated by A, B, C, or D appended to the file name on the multiple translation versions. 5.0 Annotation Five types of annotation are present in this corpus. Simple Named Entity tags names of persons, organizations, geopolitical entities, and locations (including facilities), while Full Entity also tags nominal and pronominal mentions of entities. Entity Discovery and Linking provides cross-document coreference of named entities via linking to an external knowledge base (the knowledge base used for LORELEI is released separately as LDC2020T10). Simple Semantic Annotation provides light semantic role labeling, capturing acts and states along with their arguments. Situation Frame annotation labels the presence of needs and issues related to emergent incidents such as natural disasters (e.g. food need, civil unrest), along with information such as location, urgency, and entities involved in resolving the needs. Details about each of these annotation tasks can be found in docs/annotation_guidelines/. 6.0 Data Processing and Character Normalization for LORELEI Most of the content has been harvested from various web sources using an automated system that is driven by manual scouting for relevant material. Some content may have been harvested manually, or by means of ad-hoc scripted methods for sources with unusual attributes. All harvested content was initially converted from its original HTML form into a relatively uniform XML format; this stage of conversion eliminated irrelevant content (menus, ads, headers, footers, etc.), and placed the content of interest into a simplified, consistent markup structure. The "homogenized" XML format then served as input for the creation of a reference "raw source data" (rsd) plain text form of the web page content; at this stage, the text was also conditioned to normalize white-space characters, and to apply transliteration and/or other character normalization, as appropriate to the given language. 7.0 Overview of XML Data Structures 7.1 PSM.xml -- Primary Source Markup Data The "homogenized" XML format described above preserves the minimum set of tags needed to represent the structure of the relevant text as seen by the human web-page reader. When the text content of the XML file is extracted to create the "rsd" format (which contains no markup at all), the markup structure is preserved in a separate "primary source markup" (psm.xml) file, which enumerates the structural tags in a uniform way, and indicates, by means of character offsets into the rsd.txt file, the spans of text contained within each structural markup element. For example, in a discussion-forum or web-log page, there would be a division of content into the discrete "posts" that make up the given thread, along with "quote" regions and paragraph breaks within each post. After the HTML has been reduced to uniform XML, and the tags and text of the latter format have been separated, information about each structural tag is kept in a psm.xml file, preserving the type of each relevant structural element, along with its essential attributes ("post_author", "date_time", etc.), and the character offsets of the text span comprising its content in the corresponding rsd.txt file. 7.2 LTF.xml -- Logical Text Format Data The "ltf.xml" data format is derived from rsd.txt, and contains a fully segmented and tokenized version of the text content for a given web page. Segments (sentences) and the tokens (words) are marked off by XML tags (SEG and TOKEN), with "id" attributes (which are only unique within a given XML file) and character offset attributes relative to the corresponding rsd.txt file; TOKEN tags have additional attributes to describe the nature of the given word token. The segmentation is intended to partition each text file at sentence boundaries, to the extent that these boundaries are marked explicitly by suitable punctuation in the original source data. To the extent that sentence boundaries cannot be accurately detected (due to variability or ambiguity in the source data), the segmentation process will tend to err more often on the side of missing actual sentence boundaries, and (we hope) less often on the side of asserting false sentence breaks. The tokenization is intended to separate punctuation content from word content, and to segregate special categories of "words" that play particular roles in web-based text (e.g. URLs, email addresses and hashtags). To the extent that word boundaries are not explicitly marked in the source text, the LTF tokenization is intended to divide the raw-text character stream into units that correspond to "words" in the linguistic sense (i.e. basic units of lexical meaning). 7.3 CSTRANS_TAB.xml -- Crowd-source Translation Tables The "./docs/cstrans_tab/" directory contains one "*.cstrans_tab.xml" file for each English source file that was submitted to translation via crowd sourcing. Each file contains a DOC element (with "id" and "lang" attributes), which in turn contains a "SEG" element for each "SEG" in the corresponding English ltf.xml file. Each "SEG" element may either be an empty tag (if no usable translations were submitted for the given segment), or contain one or more "TR" elements, each of which is an alternative translation for the given source segment. In either case, the "SEG" tag has an "id" attribute (unique within the given xml file, matching the SEG "id" value in ltf.xml), and an "ntrs" attribute (whose value is the number of "TR" elements present. For example: ... ... The attributes in the "TR" elements are as follows: - translatorid -- an alphanumeric string unique to each contributor; note that each translation "version" (_A, _B, etc) is likely to contain segments from different translators - avg_gold_ter may be floating-point numeric or "Unk"; it represents the "term error rate" relative to a "gold-standard" manual translation (lower value == better match) - score may be floating-point numeric or "None" - mt_ter is always floating-point numeric; it represents the "machine translation error rate" relative to a "google-translate" reference (lower value == better match) - nonwhitesp and odd_ch are always integer numerics: the count of non-whitespace characters in the string, and the count of characters that are "not in the expected language" (this can include emoticons, non-printing characters, and characters in foreign scripts). 7.4 LAF.xml -- Logical Annotation Format Data The "laf.xml" data format provides a generic structure for presenting annotations on the text content of a given ltf.xml file; see the associated DTD file in the "dtds" directory. Note that each type of annotation (simple named entity, full entity, semantic structure, NP chunking) uses the basic XML elements of LAF in different ways. NB: For Twitter data, the LDC interprets the Twitter Terms of Use to mean that no original text content from tweets may be redistributed as part of an LDC corpus. Therefore, the EXTENT elements of *_SN_000370_*.laf.xml files are presented here with underscore characters ('_') in place of all non-white-space characters in annotated strings. In order to get the actual text content for these strings, users must download and process each tweet into plain-text format (using the software provided in the "tools" directory or equivalent), and use the character-offset information in the EXTENT tag to acquire the annotated string. A correct result from this process can only be assured if the user's plain-text file for the given tweet has an MD5 signature that matches the one given in the corresponding ltf.xml file. In order to ensure that users can match the tokenization present in the annotated version of any Twitter data, a version of the ltf.xml files for annotated tweets with underscore characters ('_') in place of all non-white-space characters is provided in the data/annotation/twitter_tokenization/ directory. 7.5 LLF.xml -- LORELEI Lexicon Format Data The "llf.xml" data format is a simple structure for presenting citation-form words (headwords or lemmas) in Bengali, together with Part-Of-Speech (POS) labels and English glosses. Each ENTRY element contains a unique combination of LEMMA value (citation form in native orthography) and POS value, together with one or more GLOSS elements. Each ENTRY has a unique ID, which is included as part of the unique ID assigned to each GLOSS. For Bengali, the data/lexicon directory also contains a tab-delimited plain-text table file of supplemental lexical data; each row of this table has four columns, whose names are given in the first line of the file: 1. lemma_id -- numeric portion of the associated ENTRY ID in llf.xml 2. gloss_id -- numeric portion of the associated GLOSS ID in llf.xml 3. tag -- closed set of category labels 4. value -- value assigned to the tag for the given ENTRY/GLOSS 7.6 Situation Frame Annotation Tables Situation frame annotation consists of three parts, each presented as a separate tab-delimited file: entities, needs, and issues. The details of each table are described below. Entities, mentions, need frames, and issue frames all have IDs that follow a standard schema consisting of a prefix designating the type of ID ('Ent' for entities, 'Men' for mentions, and 'Frame' for both need and issue frames), an alphanumeric string identifying the annotation "kit", and a numeric string uniquely identifying the specific entity, mention, or frame within the document. 7.6.1 Mentions The grouping of entity mentions into "selectable entities" for situation frame annotation is provided in the mentions/ subdirectory. The table has 8 columns with the following headers and descriptions: column 1: doc_id -- doc ID of source file for the annotation column 2: entity_id -- unique identifier for each grouped entity column 3: mention_id -- unique identifier for each entity mention column 4: entity_type -- one of PER, ORG, GPE, LOC column 5: mention_status -- 'representative' or 'extra'; representative mentions are the ones which have been chosen by the annotator as the representative name for that entity. Each entity has exactly one representative mention. column 6: start_char -- character offset for the start of the mention column 7: end_char -- character offset for the end of the mention column 8: mention_text -- mention string 7.6.2 Needs Annotation of need frames is provided in the needs/ subdirectory. Each row in the table represents a need frame in the annotated document. The table has 13 columns with the following headers and descriptions: column 1: user_id -- user ID of the annotator column 2: doc_id -- doc ID of source file for the annotation column 3: frame_id -- unique identifier for each frame column 4: frame_type -- 'need' column 5: need_type -- exactly one of 'evac' (evacuation), 'food' (food supply), 'search' (search/rescue), 'utils' (utilities, energy, or sanitation), 'infra' (infrastructure), 'med' (medical assistance), 'shelter' (shelter), or 'water' (water supply) column 6: place_id -- entity ID of the LOC or GPE entity identified as the place associated with the need frame; only one place value per need frame, must match one of the entity IDs in the corresponding ent_output.tsv or be 'none' (indicating no place was named) column 7: proxy_status -- 'True' or 'False' column 8: need_status -- 'current', 'future'(future only), or 'past' (past only) column 9: urgency_status -- 'True' (urgent) or 'False' (not urgent) column 10: resolution_status -- 'sufficient' or 'insufficient' (insufficient / unknown sufficiency) column 11: reported_by -- entity ID of one or more entities reporting the need; multiple values are comma-separated, must match entity IDs in the corresponding ent_output.tsv or be 'none' column 12: resolved_by -- entity ID of one or more entities resolving the need; multiple values are comma-separated, must match entity IDs in the corresponding ent_output.tsv or be 'none' column 13: description -- string of text entered by the annotator as memory aid during annotation, no requirements for content or language, may be 'none' 7.6.3 Issues Annotation of issue frames is provided in the issues/ subdirectory. Each row in the table represents an issue frame in the annotated document. The table has 9 columns with the following headers and descriptions: column 1: user_id -- user ID of the annotator column 2: doc_id -- doc ID of source file for the annotation column 3: frame_id -- unique identifier for each frame column 4: frame_type -- 'issue' column 5: issue_type -- exactly one of 'regimechange' (regime change), 'crimeviolence' (civil unrest or widespread crime), or 'terrorism' (terrorism or other extreme violence) column 6: place_id -- entity ID of the LOC or GPE entity identified as the place associated with the issue frame; only one place value per issue frame, must match one of the entity IDs in the corresponding ent_output.tsv or be 'none' column 7: proxy_status -- 'True' or 'False' column 8: issue_status -- 'current' or 'not_current' column 9: description -- string of text entered by the annotator as memory aid during annotation, no requirements for content or language, may be 'none' 7.7 EDL Table The "./data/annotation/entity/" directory contains the file "ben_edl.tab", which has an initial "header" line of column names followed by data rows with 8 columns per row. The following shows the column headings and a sample value for each column: column 1: system_run_id LDC column 2: mention_id Men-BEN_NW_000093_20060304_70R002FWJ-31 column 3: mention_text জিগাতলা column 4: extents BEN_NW_000093_20060304_70R002FWJ:646-652 column 5: kb_id 7701348 column 6: entity_type LOC column 7: mention_type NAM column 8: confidence 1.0 When column 5 is fully numeric, it refers to a numbered entity in the Reference Knowledge Base (distributed separately as LDC2020T10). Note that a given mention may be ambiguous as to the particular KB element it represents; in this case, two or more numeric KB_ID values will appear in column 5, separated by the vertical-bar character (|). When column 5 consists of "NIL" plus digits, it refers to an entity that is not present in the Knowledge Base, but this label is used consistently for all mentions of the particular entity. 8.0 Software tools included in this release 8.1 "ltf2txt" (source code written in Perl) A data file in ltf.xml format (as described above) can be conditioned to recreate exactly the "raw source data" text stream (the rsd.txt file) from which the LTF was created. The tools described here can be used to apply that conditioning, either to a directory or to a zip archive file containing ltf.xml data. In either case, the scripts validate each output rsd.txt stream by comparing its MD5 checksum against the reference MD5 checksum of the original rsd.txt file from which the LTF was created. (This reference checksum is stored as an attribute of the "DOC" element in the ltf.xml structure; there is also an attribute that stores the character count of the original rsd.txt file.) Each script contains user documentation as part of the script content; you can run "perldoc" to view the documentation as a typical unix man page, or you can simply view the script content directly by whatever means to read the documentation. Also, running either script without any command-line arguments will cause it to display a one-line synopsis of its usage, and then exit. ltf2rsd.perl -- convert ltf.xml files to rsd.txt (raw-source-data) ltfzip2rsd.perl -- extract and convert ltf.xml files from zip archives 8.2 ldclib -- general text conditioning, twitter harvesting The "bin/" subdirectory of this package contains three executable scripts (written in Ruby): create_rsd.rb -- convert general xml or plain-text formats to "raw source data" (rsd.txt), by removing markup tags and applying sentence segmentation token_parse.rb -- convert rsd.txt format into ltf.xml get_tweet_by_id.rb -- download and condition Twitter data Due to the Twitter Terms of Use, the text content of individual tweets cannot be redistributed by the LDC. As a result, users must download the tweet contents directly from Twitter and condition/normalize the text in a manner equivalent to what was done by the LDC, in order to reproduce the Bengali raw text that was used by LDC for annotation (to be released separately). The twitter-processing software provided in the tools/ directory enables users to perform this normalization and ensure that the user's version of the tweet matches the version used by LDC, by verifying that the md5sum of the user-downloaded and processed tweet matches the md5sum provided in the twitter_info.tab file. Users must have a developer account with Twitter in order to download tweets, and the tool does not replace or circumvent the Twitter API for downloading tweets. The ./docs/twitter_info.tab file provides the twitter download id for each tweet, along with the LORELEI file name assigned to that tweet and the md5sum of the processed text from the tweet. The file "README.md" in this directory provides details on how to install and use the source code in this directory in order to condition text data that the user downloads directly from Twitter and produce both the normalized raw text and the segmented, tokenized LTF.xml output. All LDC-developed supporting files (models, configuration files, library modules, etc.) are included, either in the "lib" subdirectory (next to "bin"), or else in the parent ("tools") directory. Please refer to the README.md file that accompanies this software package. 8.3 sent_seg -- apply sentence segmentation to raw text The Python tools in this directory are used as part of the conditioning done by "create_rsd.rb" in the "ldclib" package. Please refer to the README.rst file included with the package. 8.4 ne_tagger -- Named-Entity tagger for Bengali Please refer to the ./tools/ben/ne-tagger/README.rst file for information about usage and performance. 8.5 transliterator -- provides a Romanization for Bengali text Please refer to the README files in ./tools/ben/transliterator/ for information about usage and performance. 9.0 Documentation included in this release The ./docs folder (relative to the root directory of this release) contains six files documenting various characteristics of the source data: char_tally.{lng}.tab - contains tab separated columns: doc uid, number of non-whitespace characters, number of non-whitespace characters in the expected script, and number of anomalous (non-printing) characters for each document in the release source_codes.txt - contains tab-separated columns: genre, source code, source name, and base url for each source in the release twitter_info.tab - contains tab-separated columns: doc uid, tweet id, normalized md5 of the tweet text, and tweet author id for all tweets in the release urls.tab - contains tab-separated columns: doc uid and url. Note that the url column is empty for documents from older releases for which the url is not available; they are included here so that the uids column can serve as a document list for the package. crowdsource_yield.tab - provides information about the number of untranslated segments in different versions of the crowdsource translation for each file; crowdsource translation did not always yield complete translation of each segment of the document, thus difference versions (A, B, C, D) of a crowdsource file may have different numbers of translated segments annotated_crowdsource_coverage.tab - provides information about number of translated/untranslated segments in any files selected for annotation for which the translation was performed via crowdsourcing In addition, the grammatical sketch, annotation guidelines, and cs_trans contents described in earlier sections of this README are found in this directory. 10.0 KNOWN ISSUES 10.1 Professional translations from REFLEX treated as "found parallel" The "./data/translation/found/" file inventory includes 556 file pairs that were drawn from the REFLEX Bengali language pack, where they had been presented as the "translation/from_ben" component of that corpus. In many of these file pairs, the Bengali and English files have different segment counts, and in at least some of the remaining pairs, it's likely that segment boundaries differ significantly (e.g. a single segment in one side of the pair corresponds to two or more segments in the other side, and the combined result of multiple discrepancies throughout the given document coincidentally yield the same total number of segments in both sides). All 556 files in this set are therefore treated as "found parallel", in order to apply the standard procedures for doing automatic segment-level alignment. This set of 556 files is mixed in with a larger quantity of more recently harvested data from web sources known to contain parallel English/Bengali content. Users can tell the difference between REFLEX and more recent data by means of the final 9-character field in each data file name -- in all REFLEX files, this field begins with "_70R0" (as opposed to more recent data, where the field begins with "_G0" or "_H0"). 11.0 Acknowledgements The authors would like to acknowledge the following contributors to this corpus: Brian Gainor, Ann Bies, Justin Mott, Neil Kuster, University of Maryland Applied Research Laboratory for Intelligence and Security (ARLIS), formerly UMD Center for Advanced Study of Language (CASL), and our team of Bengali annotators. This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0123. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA. 12.0 Copyright Portions © 2017 Aaj Media Inc., © 2016 ABP News, © 2002-2007, 2009-2010 Agence France Presse, © 2000 American Broadcasting Company, © 2016-2017 Ananda Bazar, © 2015-2017 Bangladesh Pratidin, © 2011, 2016 Bangla News 24, © 2005-2006, 2016-2017 BBC, © 2016 Bhorekagoj, © 2000 Cable News Network, LP, LLP, © 2008 Central News Agency (Taiwan), © 2015-2017 Chandpur Kantho, © 2006, 2015-2017 China Radio International, © 1989 Dow Jones & Company, Inc., © 2017 Eibela.com, © 2007-2012, 2016-2017 Global Voices, © 2016-2017 Kaler Kantho, © 2005 Los Angeles Times - Washington Post News Service, Inc., © 2000 National Broadcasting Company, Inc., © 1999, 2005, 2006, 2010 New York Times, © 2000 Public Radio International, © 2012-2017 Satkhiranews24, © 2003, 2005-2008, 2010 The Associated Press, © 2016 The Daily Janakantha, © 2016-2017 The Daily Nayadiganta, © 2003, 2005-2008 Xinhua News Agency, © 2017, 2020 Trustees of the University of Pennsylvania 13.0 CONTACTS Stephanie Strassel - LORELEI PI