README FILE FOR LDC CATALOG ID: LDC2025T08

TITLE: LoReHLT Uzbek Representative Language Pack

AUTHORS: Jennifer Tracey, Stephanie Strassel, Dave Graff, Jonathan Wright,
         Song Chen, Neville Ryant, Seth Kulick, Kira Griffitt, Dana Delgado,
         Michael Arrigo


1.0 Introduction

This corpus provides the complete set of monolingual and parallel text,
lexicon, annotations, and tools comprising the LoReHLT Uzbek Representative
Language Pack.  It was developed by the Linguistic Data Consortium, and
consists of over 47 million words of monolingual text in Uzbek, over 886,000
words of which have been translated into English.  It also includes over
100,000 Uzbek words translated from English text, plus about 563,000 words for
which existing parallel text in English was found on the internet.  Over
151,000 words received simple named entity annotation, and over 28,000 words
received full entity annotation (including nominals and pronouns); varying
subsets also underwent noun-phrase chunking, morphological alignment, and
simple semantic annotation.  Details about the volume of data for each
annotation type are listed in section 3.3 below.

LoReHLT (Low Resource Human Language Technology) was a companion project of
the DARPA LORELEI Program (Low Resource Languages for Emergent Incidents),
which was concerned with building Human Language Technology for low resource
languages in the context of emergent situations like natural disasters or
disease outbreaks.  The present package is the result of a pilot effort
preceding the main LORELEI collection project; as such, it has a lot in common
with the overall structure of other LORELEI language packs, but also some
notable differences (mainly involving file name patterns and the types of
annotation done).

Linguistic resources for LORELEI include Representative Language Packs for
over 2 dozen low resource languages, comprising data, annotations, basic
natural language processing tools, lexicons and grammatical resources.
Representative languages are selected to provide broad typological coverage,
while Incident Languages are selected to evaluate system performance on a
language whose identity is disclosed at the start of the evaluation, and for
which no training data has been provided.

For more information about LORELEI language resources, see:
https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2020-lorelei-language-packs.pdf


2.0 Corpus organization

2.1 Directory Structure

The directory structure and contents of the package are summarized below --
paths shown are relative to the base (root) directory of the package:

./dtds/
./dtds/laf.v1.2.dtd
./dtds/llf.v1.6.dtd
./dtds/ltf.v1.5.dtd
./dtds/psm.v1.0.dtd

./docs/  -- contains this README, plus various tables and listings (see section 9 below)
./docs/annotation_guidelines/ -- guidelines for all annotation tasks included in this corpus
./docs/grammatical_sketch/    -- grammatical sketch of Uzbek

./tools/ -- see section 8 below for details about tools provided
./tools/ltf2txt/
./tools/ne_tagger/
./tools/pos_tagger/
./tools/sentence_segmenter/
./tools/tokenizer_analyzer/
./tools/twitter_processing/

./data/monolingual_text/zipped/    -- zip archives of ltf and psm files

./data/translation/
   from_uzb/{uzb,eng}/             -- translations from Uzbek to English
   from_eng/                       -- translations from English to Uzbek
    {elicitation,news,phrasebook}/    for each of three types of English data:
     {uzb,eng}/                       for each language in each directory,
                                      "ltf" and "psm" directories contain
                                      corresponding data files

./data/annotation/ -- see section 5 below for details about annotation
./data/annotation/entity/{simple,full}/
./data/annotation/morph_alignment/
./data/annotation/np_chunking/
./data/annotation/sem_annotation/
./data/annotation/twitter_tokenization/

./data/audio/    -- 299 FLAC and 163 mp4 audio files (YouTube, broadcast news sources)
./data/lexicon/  -- llf.xml lexicon

2.2 File Name Conventions

The file names assigned to individual documents in this corpus provide the
following information about the document -- note that the LoReHLT naming
conventions differ from those of the later LORELEI packages:

   Genre     2-letter abbrev.
   Source    3-letter label assigned to data provider
   Language  3-letter abbrev.
   Index#    6-digit numeric assigned to this document
   Date      8-digit numeric: YYYYMMDD year, month, day

Those five fields are joined by underscore characters, yielding a 26-character
file-ID.

The 2-letter codes used for genre are as follows:

   BN -- broadcast news (found in audio data only)
   DF -- discussion forum
   NW -- news
   RF -- reference (e.g. Wikipedia)
   SN -- social network (Twitter text, YouTube audio)
   WL -- web-log

In the "./data/monolingual_text/zipped/" directory, all documents in a given
genre have been placed together into a single zip archive, as follows:

  DF_ALL_UZB.{ltf,psm}.zip
  NW_ALL_UZB.{ltf,psm}.zip
  RF_WKP_UZB.{ltf,psm}.zip (all "reference" docs are from Wikipedia)
  WL_ALL_UZB.{ltf,psm}.zip

The number of data files per zip archive ranges from 12 (WL_ALL) to 123,799
(RF_WKP).


3.0 Content Summary

3.1 Monolingual Text

Genre	  #Docs 	  #Words
DF	 12,886 	20,725,547
NW	 49,893 	17,919,194
RF	123,799 	 8,739,709
SN	  9,783 	    90,422
WL	     12 	     2,382

Note that the SN (Twitter) data cannot be distributed directly by LDC, due to
the Twitter Terms of Use.  The file "docs/twitter_info.tab" (described in
Section 8.2 below) provides the necessary information for users to fetch the
particular tweets directly from Twitter.  LTF files for all other genres are
stored in ./data/monolingual_text/zipped/.

3.2 Parallel Text

Type    Genre   #Docs    #Words
---
Found      NW	1,955	563,385
---
FromEng    EL	    3	 29,077
FromEng    NW	  198	 71,174
---
ToEng      DF	  501	226,986 *
ToEng      NW	1,218	583,417
ToEng      SN	8,308	 74,085
ToEng	   WL	   12	  2,382
---

* Note: Four of the DF documents (listed below) were truncated before being
  submitted for translation into English; the data/translation/from_uzb/uzb/
  directory contains the truncated versions of these ltf and psm files, while
  data/monolingual_text/zipped/DF_ALL_UZB.*.zip contain the corresponding
  original (not truncated) versions:

  DF_ISL_UZB_165220_20140900
  DF_ZIY_UZB_165227_20140900
  DF_OZO_UZB_165246_20140900
  DF_OKR_UZB_165229_20140900

Again, because LDC cannot distribute original Twitter data, we present only
the English translations for 8,308 Tweets: SN_TWT_UZB_*.ltf.xml files exist
under "./data/translation/from_uzb/eng/" only.  Note that the SN file
inventory was originally organized in groups, such that each group was
assigned a distinct 6-digit index number for the 4th field of the file name,
and held up to 30 Tweets.  In order to present each translated Tweet as a
separate data file, we have appended an additional 2-digit index number at the
end of each file name -- e.g.:

   SN_TWT_UZB_190841_20140900-00.eng.ltf.xml
   SN_TWT_UZB_190841_20140900-01.eng.ltf.xml
   ...
   SN_TWT_UZB_190841_20140900-28.eng.ltf.xml
   SN_TWT_UZB_190842_20140900-00.eng.ltf.xml
   ...
   SN_TWT_UZB_191130_20140900-28.eng.ltf.xml

Each full tweet is presented as the sole <SEG> element in one ltf.xml file.
There are no paragraph or sentence boundaries in twitter text, so there are no
SN_TWT_*.psm.xml files.

3.3 Annotation

AnnotationType  Genre #Docs	 #Words
---
EntityFull	DF	 14	  9,271
EntityFull	NW	 47	 17,963
EntityFull	SN	122	  1,370
---
EntitySimple	DF	 39	 26,764
EntitySimple	NW	316	121,152
EntitySimple	SN	330	  3,834
---
MorphAlign	DF	 12	  2,684
MorphAlign	NW	 19	  5,493
---
NPChunking	DF	 12	  3,927
NPChunking	NW	 30	  8,614
NPChunking	SN	 55	    641
---
SimpleSemantic	DF	 17	  6,634
SimpleSemantic	EL	  1	    298 *
SimpleSemantic	NW	 42	 13,964

* Note: For semantic annotation, a sample of just 70 segments was selected
  from the full content (2600 segments) of BOLT_Elicitation.uzb (translated
  from English into Uzbek); the word count shown here is for the sample.

4.0 Data Collection and Parallel Text Creation

Both monolingual text collection and parallel text creation involve a
combination of manual and automatic methods.  These methods are described in
the sections below.

4.1 Monolingual Text Collection

Data is identified for collection by native speaker "data scouts," who search
the web for suitable sources, designating individual documents that are in the
target language and discuss the topics of interest to the LORELEI program
(humanitarian aid and disaster relief).  Each document selected for inclusion
in the corpus is then harvested, along with the entire website when suitable.
Thus the monolingual text collection contains some documents which have been
manually selected and/or reviewed and many others which have been
automatically harvested and were not subject to manual review.

4.2 Parallel Text Creation

Parallel text for LORELEI was created using three different methods, and each
LORELEI language may have parallel text from one or all of these methods.  In
addition to translation from each of the LORELEI languages to English, each
language pack contains a "core" set of English documents that were translated
into each of the LORELEI Representative Languages.  These documents consist of
news documents, a phrasebook of conversational sentences, and an elicitation
corpus of sentences designed to elicit a variety of grammatical structures.
All translations are aligned at the sentence level.  For professional and
crowdsourced translation, the segments align one-to-one between the source and
target language (i.e. segment 1 in the English aligns with segment 1 in the
source language).  For found parallel text, automatic alignment is performed
and a separate alignment file provides information about how the segments in
the source and translation are aligned.

Professionally translated data has one translation for each source document,
while crowdsourced translations have up to four translations for each source
document, designated by A, B, C, or D appended to the file name on the
multiple translation versions.


5.0 Annotation

Five types of annotation are present in this corpus:

 - Simple Named Entity tags names of persons, organizations, geopolitical
   entities, and locations (including facilities).

 - Full Entity also tags nominal and pronominal mentions of entities.

 - Noun Phrase Chunking identifies the positions and extents of noun phrases.

 - Simple Semantic Annotation provides light semantic role labeling, capturing
   acts and states along with their arguments.

 - Morphological Alignment provides parallel text (in ltf.xml format) with
   detailed morphological labeling on word tokens, plus a corresponding set of
   alignment listings that identify (where possible) the relations of Uzbek
   and English morphemes.  See docs/uzb_morph_alignment_README.txt for details.

Results of the first four annotation types are stored in LAF XML format (see
section 7.3 below), with annotations for one document in each XML file.

Details about each of these annotation tasks can be found in
docs/annotation_guidelines/.

SPECIAL NOTE ABOUT ANNOTATIONS ON TWITTER DATA:

The LDC cannot redistribute text data from Twitter, and this includes files
containing annotation.  Where LAF XML and annotation table files have strings
of text from other sources, annotations of Twitter data instead have strings
with underscores ("_") replacing all non-white-space characters.

Software is included in this release that enables users to download a given
list of Tweets (assuming the Tweets are still available online), and apply the
same conditioning and reformatting that was done by LDC prior to annotation --
see section 8.2 below for more details on the software.

In order to confirm that your own download and conditioning yields results
that match those of the LDC, we provide a set of LTF XML files (one for each
annotated Tweet), in which the text content has been modified by replacing
each non-white-space character with an underscore ("_"), so that character
offsets are preserved for word tokens and spans of annotations.

These "placeholder" LTF XML files are in data/annotation/twitter_tokenization/.


6.0 Data Processing and Character Normalization for LORELEI

Most of the content has been harvested from various web sources using an
automated system that is driven by manual scouting for relevant material.
Some content may have been harvested manually, or by means of ad-hoc scripted
methods for sources with unusual attributes.

All harvested content was initially converted from its original HTML form
into a relatively uniform XML format; this stage of conversion eliminated
irrelevant content (menus, ads, headers, footers, etc.), and placed the
content of interest into a simplified, consistent markup structure.

The "homogenized" XML format then served as input for the creation of a
reference "raw source data" (rsd) plain text form of the web page content; at
this stage, the text was also conditioned to normalize white-space characters,
and to apply transliteration and/or other character normalization, as
appropriate to the given language.


7.0 Overview of XML Data Structures

7.1 PSM.xml -- Primary Source Markup Data

The "homogenized" XML format described above preserves the minimum set of tags
needed to represent the structure of the relevant text as seen by the human
web-page reader.  When the text content of the XML file is extracted to create
the "rsd" format (which contains no markup at all), the markup structure is
preserved in a separate "primary source markup" (psm.xml) file, which
enumerates the structural tags in a uniform way, and indicates, by means of
character offsets into the rsd.txt file, the spans of text contained within
each structural markup element.

For example, in a discussion-forum or web-log page, there would be a division
of content into the discrete "posts" that make up the given thread, along with
"quote" regions and paragraph breaks within each post.  After the HTML has
been reduced to uniform XML, and the tags and text of the latter format have
been separated, information about each structural tag is kept in a psm.xml
file, preserving the type of each relevant structural element, along with its
essential attributes ("post_author", "date_time", etc.), and the character
offsets of the text span comprising its content in the corresponding rsd.txt
file.

7.2 LTF.xml -- Logical Text Format Data

The "ltf.xml" data format is derived from rsd.txt, and contains a fully
segmented and tokenized version of the text content for a given web page.
Segments (sentences) and the tokens (words) are marked off by XML tags (SEG
and TOKEN), with "id" attributes (which are only unique within a given XML
file) and character offset attributes relative to the corresponding rsd.txt
file; TOKEN tags have additional attributes to describe the nature of the
given word token.

The segmentation is intended to partition each text file at sentence
boundaries, to the extent that these boundaries are marked explicitly by
suitable punctuation in the original source data.  To the extent that sentence
boundaries cannot be accurately detected (due to variability or ambiguity in
the source data), the segmentation process will tend to err more often on the
side of missing actual sentence boundaries, and (we hope) less often on the
side of asserting false sentence breaks.

The tokenization is intended to separate punctuation content from word
content, and to segregate special categories of "words" that play particular
roles in web-based text (e.g. URLs, email addresses and hashtags).  To the
extent that word boundaries are not explicitly marked in the source text, the
LTF tokenization is intended to divide the raw-text character stream into
units that correspond to "words" in the linguistic sense (i.e. basic units of
lexical meaning).

Software is included to convert ltf.xml files to "raw source data" plain text
files ("rsd.txt") -- see section 8.1 below.  The character offsets used in LTF
and LAF xml, and in other types of annotation data, are based on the "rsd.txt"
files, which contain just the text that is visible to a person reading the
original source, with normalized white-space characters (including line
breaks), but without markup of any kind.

7.3 LAF.xml -- Logical Annotation Format Data

The "laf.xml" data format provides a generic structure for presenting
annotations on the text content of a given ltf.xml file; see the associated
DTD file in the "dtds" directory.  Note that each type of annotation (simple
named entity, full entity, simple semantic annotation) uses the basic XML
elements of LAF in different ways.

7.4 LLF.xml -- LORELEI Lexicon Format Data

The "llf.xml" data format is a simple structure for presenting citation-form
words (headwords or lemmas) in Uzbek, together with Part-Of-Speech (POS)
labels and English glosses.  Each ENTRY element contains a unique combination
of LEMMA value (citation form in native orthography) and POS value, together
with one or more GLOSS elements.  Each ENTRY has a unique ID, which is
included as part of the unique ID assigned to each GLOSS.


8.0 Software tools included in this release

Each of the software components summarized below contains its own README file
or other documentation, which should be consulted for more detailed usage
information.  Note that the versions of software provided here are consistent
with the original package release to LORELEI project participants in 2015; in
later LORELEI releases, software was updated and reorganized to handle various
changes in corpus handling and design (e.g. to use a different file name
format).  This software is being provided in hopes that it will be
informative, but with warrantee as to its usability.

8.1 "ltf2txt" (source code written in Perl)

A data file in ltf.xml format (as described above) can be conditioned to
recreate exactly the "raw source data" text stream (the rsd.txt file) from
which the LTF was created.  The tools described here can be used to apply that
conditioning, either to a directory or to a zip archive file containing
ltf.xml data.  In either case, the scripts validate each output rsd.txt stream
by comparing its MD5 checksum against the reference MD5 checksum of the
original rsd.txt file from which the LTF was created.  (This reference
checksum is stored as an attribute of the "DOC" element in the ltf.xml
structure; there is also an attribute that stores the character count of the
original rsd.txt file.)

Each script contains user documentation as part of the script content; you can
run "perldoc" to view the documentation as a typical unix man page, or you can
simply view the script content directly by whatever means to read the
documentation.  Also, running either script without any command-line arguments
will cause it to display a one-line synopsis of its usage, and then exit.

   ltf2ma.perl      -- convert ltf.xml files to ma_tkn.txt (morph-alignment)
   ltf2rsd.perl     -- convert ltf.xml files to rsd.txt (raw-source-data)
   ltfzip2rsd.perl  -- extract and convert ltf.xml files from zip archives

Special note about Twitter data: as explained in section 5 above, this corpus
includes "scrubbed" versions of LTF XML files for individual Tweets, where the
original text characters (except for spaces) are replaced by underscores (in
data/annotation/twitter_tokenization/), in order to comply with Twitter Terms
of Use.  Running "ltf2rsd.perl" directly on these "scrubbed" files will yield
warnings about MD5 mismatches, which is to be expected, because the MD5 value
stored in each Twitter LTF XML file is based on the original text.  After
using the "ldclib" software (described in the next section) to download and
condition Twitter data, the resulting LTF XML files should have both the
original text and the matching MD5 values; that process also creates the
corresponding rsd.txt files.

8.2 twitter_processing

This directory contains a README file, and executable script written in Ruby,
and supporting files (Gemfile and a lib/ directory).  Refer to the README file
for details on using these scripts.

Due to the Twitter Terms of Use, the text content of individual tweets cannot
be redistributed by the LDC.  As a result, users must download the tweet
contents directly from Twitter.  The twitter-processing software provided in
the tools/ directory enables users to perform the same normalization applied
by LDC and ensure that the user's version of the tweet matches the version
used by LDC, by verifying that the md5sum of the user-downloaded and processed
tweet matches the md5sum provided in the twitter_info.tab file.  Users must
have a developer account with Twitter in order to download tweets, and the
tool does not replace or circumvent the Twitter API for downloading tweets.

The ./docs/twitter_info.tab file provides the twitter download id for each
tweet, along with the LORELEI file name assigned to that tweet, the numeric ID
of the tweet author, and the md5sum of the processed text from the tweet.

8.3 sentence_segmenter -- apply sentence segmentation to raw text

The Python and Ruby scripts in this directory are used to apply sentence
boundary detection to text.  Please refer to the README.txt file included with
the package.

8.4 ne_tagger -- Named-Entity tagger for Uzbek

Please refer to the tools/ne_tagger/README.txt file for information about
usage and performance.

8.5 morph_analyzer

There are three README files in this directory to explain the usage of the
software:

   tools/morph_analyzer/README.txt
   tools/morph_analyzer/engine/README.txt
   tools/morph_analyzer/foma/README.txt

8.6 pos_tagger

This contains source code in Python for doing part-of-speech tagging on text
data, using ltf.xml as input.  Please refer the README.txt file in this
directory for information about usage and performance.

8.7 encoding

This directory has executable scripts written in Ruby, along with a
configuration file, to handle conversions between Latin and Cyrillic character
sets, with special attention to Uzbek-specific rules involving apostrophe-like
orthographic marks.  Please refer to the tools/encoding/README.txt file for
information about usage.


9.0 Documentation included in this release

The ./docs folder (relative to the root directory of this release) contains
the following:

  audio_info.tab - lists the 462 *.flac and *.mp4 files in ./data/audio/,
showing their channel count, sample rate, duration, and topic(s).

  {BOLT,LCTL}_elicitation_template.txt - list the 2600 and 3126 English
phrases, respectively, used to create the elicitation portion of the
English-to-Uzbek translation data.  Each file is organized as a sequence of
blank-line-separated "paragraphs", with each paragraph containing the
segment-ID (i.e. "segment-0" .. "segment-2599"), the English sentence to be
translated into the target language, and supplemental context information (if
any) to guide the translation.  The Uzbek corpus differs from other LORELEI
Representative language packs in having two distinct elicitation templates
rather than one; this stems from the fact that the current Uzbek corpus,
developed under the DARPA "Broad Operational Language Translation" (BOLT)
program, was built on top of a closely related collection that had been done
previously under an earlier program called "Less Commonly Taught Languages"
(LCTL).  The two templates happen to have 312 English phrases in common, and a
few dozen of these phrases are presented with two or more distinct "context:"
values in one or both lists, but because the two lists were translated
independently, years apart and by different translators, the corresponding
translations into Uzbek are generally not fully identical for a given English
phrase and context.

  source_codes.txt - a four-column table listing the distinct 3-letter codes
that identify data sources: values from the 2nd field of data file names
(e.g. "VOA") appear in the 2nd column of this table.  For each source, the
first column shows the genre (some sources yielded data in multiple genres,
and so those source codes appear in multiple rows); the 3rd and 4th columns
contain the full name and base URL of the source.  (Base URLs were not
retained for WL sources.)

  urls.tab - a two-column table relating document file-IDs to the web URLs
that were used to download them.  Because data collection for this package was
done as a pilot project for LORELEI, document URLs were retained only for NW
sources.

  uzb_morph_alignment_README.txt - explains the creation, guidelines and usage
of the morphological alignment annotation, which was done on 29 Uzbek data
files and their English translations.

  uzb_morph_analysis_files.txt - lists the names of the 36 files that
underwent manual morphological analysis and part-of-speech tagging, with
humans correcting automatic analysis.  The morphological and POS annotations
are included as attributes in the LTF data format as described above.  Full
details on the tagset used are found in the annotation_guidelines directory.
Note that there may be discrepancies between this tagset and the categories
described in the grammatical sketch.

  twitter_info.tab - contains tab-separated columns: doc uid, tweet id,
normalized md5 of the tweet text, and tweet author id for all tweets
designated for use in this language pack.

In addition, the grammatical sketch and annotation guidelines mentioned in
earlier sections of this README are found in this directory.


10.0 Acknowledgements

The authors would like to acknowledge the following contributors to this
corpus: Brian Gainor, Ann Bies, Justin Mott, Neil Kuster, University of
Maryland Applied Research Laboratory for Intelligence and Security (ARLIS),
formerly UMD Center for Advanced Study of Language (CASL), and our team of
Uzbek annotators.

This material is based upon work supported by the Defense Advanced Research
Projects Agency (DARPA) under Contract No. HR0011-15-C-0123.  Any opinions,
findings and conclusions or recommendations expressed in this material are
those of the author(s) and do not necessarily reflect the views of DARPA.


12.0 Copyright

Portions � 2005 12us.com, � 2012 21Asr.uz, � 2005 sof-olam.6te.net, � 2013
ajoyib.net, � 2014 albuxority.com, � 2013 amuziyo.com, � 2014 anon.uz,
� 2012, 2014 ARXIV, � 2013 biznes.daily.uz, � 2014 BePuL.NeT, � 2013 bil.uz,
� 2013 bizstrener.uz, �) 2013, 2014 AKIpress News Agency, � 2013, 2014
championat.asia, � 2013 diyormedia.uz, � 2014 darakchi.uz, � 2009, 2011 Daryo,
� 2013 Distlik Bayrogi, � 2010 econews.uz, � 2014 DMP under MDPE � 2012 Facebook,
� 2007, 2011 Ferghana News Agency, Moscow, � 2014 Gooper.uz, � 2004-2006 Harakat,
� 2014 Uzbek Huquq, � 2011 Huquq, � 2012 Human Rights Society of Uzbekistan,
� 2014 Huquq Burch, � 2012 intiqom.uz, � 2009 Islambio.com, � 2006 islom.uz,
� 2010 jamiyatgzt.uz, � 2012 kamolon.uz, � 2014 Kokand, � 2011-2014 Kun.uz,
� 2013 LUKOIL Uzbekistan Operating Company LLC, � 2004, 2006 Marifat, � 2014
megauz.uz, � 2013 Medislam, � 2009 Centre of Hydrometerological Service at Cabinet
Ministers of the Republic of Uzbekistan (Uzhydromed), � 2014 Mulkdor.com, � 2010
Mohiyat, � 2014 Mp3lar.com, � 2014 Muloqot, � 2012 muslimaat.uz, � 2013 Vatanparvar,
� 2001, 2012 Ozbekiston Elektron Ommaviy Axborot Vositalari Milliy Assotsiatsiyasi,
� 2011 Navoiy Press, � 2014 news24.uz, � 2014 Oila Davrasida, � 2013 Odnoklassniki,
� 2013 Olam Asia, � 2009 oriftolib.uz, � 2014 pressnews.uz, � 2013 Qadriyat.uz,
� 2012 Qashqadaryogz, � 2014 Karachik, � 2014 Questpedia, � 2014 quvnoq.com, � 2011
Rambler, � 2014 Sadolar.net, � 2011, 2013 Uzfunfactory & Sayyod Media Group, � 2013
The GEF Small Grants Program, � 2014 shejot.com, � 2014 Shamsutdinovs Business Group,
� 2012 Software.uz, � 2014 Soglik.Uz, � 2014 Soyabon Group, � 2014 Sports.uz, � 2014
TDPU, � 2009, 2010 Termiz Okshomi, � 2008 Tashkentskaya Pravda, � 2014 Tarona.net,
� 2014 Takewap Group, � 2012 Uzbegim, � 2014 Uzclub.Net, � 2013 Embassy of the
Republic Uzbekistan to the United Kingdom of Great Britain and Northern Ireland,
� 2013 Uzbek.Fm, � 2012 Uzbekislam.com, � 2014 United Nations Development Programme,
� 2012 UZBnews, � 2014 uz24.uz, � 2007-2012 UzA, � 2012 Uzbaby.uz, � 2011 UzCinema,
� 2012 CDMEP, � 2009 usfayl.com, � 2010, 2013 uzhurriyat.com, � 2011, 2012
uskinozal.com, � 2013 UzLider.Mobi, � 2007, 2011 UZNEWS.NET, � 2014 Uzbekistan news
- UzReport.uz, � 2010 Public Health of Uzbekistan, � 2011, 2014 us-world.ru,
� 2012 2014 Vatandosh, Inc., � 2013 viloyat-arm.uz, � 2014 mirjahon.weebly.com,
� 2012 www.welcomebackuz.com, � 2013, 2014 Qulnoma,� 2011 xabar.org,
� 2011, 2014 xayol.uz, (c) 2012 xorazamtibbiyoti.com, (c) 2014 xs.uz, � 2012 Yangi
Dunya, � 2013 MoDISaNyntymak, � 2013 zamondosh.uz, � 2014 www.zamonaviy.uz,
� 2002-2007, 2009-2010 Agence France Presse, � 2000 American Broadcasting Company,
� 2000 Cable News Network LP, LLLP, � 2008 Central News Agency (Taiwan), � 1989 Dow
Jones & Company, Inc., � 2005 Los Angeles Times - Washington Post News Service, Inc., 
� 2000 National Broadcasting Company, Inc., � 1999, 2005, 2006, 2010 New York Times, 
� 2000 Public Radio International, � 2003, 2005-2008, 2010 The Associated Press, 
� 2003, 2005-2008 Xinhua News Agency, � 2014 Trustees of the University of Pennsylvania

13.0 CONTACTS

Dana Delgado <foredana@ldc.upenn.edu> - LORELEI Project Manager