Home › Language Resources › Data

LoReHLT Uzbek Representative Language Pack

Item Name:	LoReHLT Uzbek Representative Language Pack
Author(s):	Jennifer Tracey, Stephanie Strassel, David Graff, Jonathan Wright, Song Chen, Neville Ryant, Seth Kulick, Dana Delgado, Michael Arrigo
LDC Catalog No.:	LDC2025T08
ISLRN:	370-274-581-227-7
DOI:	https://doi.org/10.35111/t5qx-jc85
Release Date:	July 15, 2025
Member Year(s):	2025
DCMI Type(s):	Software, Sound, Text
Sample Type:	flac, mp4
Sample Rate:	16000, 44100
Data Source(s):	broadcast news, discussion forum, newsgroups, newswire, web collection, weblogs
Project(s):	BOLT, LORELEI
Application(s):	cross-language transfer, entity extraction, information extraction, machine translation
Language(s):	English, Uzbek
Language ID(s):	eng, uzb
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2025T08 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Tracey, Jennifer, et al. LoReHLT Uzbek Representative Language Pack LDC2025T08. Web Download. Philadelphia: Linguistic Data Consortium, 2025.
Related Works: Hide	View hasPart LDC2026T06 LORELEI Multiway Translated Text isSimilarWith LDC2020T11 LORELEI Oromo Incident Language Pack LDC2020T24 LORELEI Ukrainian Representative Language Pack LDC2020T22 LORELEI Tigrinya Incident Language Pack LDC2021T02 LORELEI Akan Representative Language Pack LDC2022T01 LORELEI Kinyarwanda Incident Language Pack LDC2022T03 LORELEI Wolof Representative Language Pack LDC2023T07 LORELEI Indonesian Representative Language Pack LDC2023T02 LORELEI Tagalog Representative Language Pack LDC2023T03 LORELEI Tamil Representative Language Pack LDC2023T06 LORELEI Zulu Representative Language Pack LDC2023T01 LORELEI Swahili Representative Language Pack LDC2024T01 LORELEI Farsi Representative Language Pack LDC2024T10 LORELEI Yoruba Representative Language Pack LDC2024T03 LoReHLT Hausa Representative Language Pack LDC2025T01 LORELEI Hungarian Representative Language Pack LDC2025T12 LORELEI Hindi Representative Language Pack LDC2026T01 LORELEI Russian Representative Language Pack LDC2026T03 LORELEI Somali Representative Language Pack

Introduction

LoReHLT Uzbek Representative Language Pack consists of Uzbek monolingual text, Uzbek-English parallel text, annotations, audio recordings, supplemental resources and related software tools developed by the Linguistic Data Consortium for LoReHLT, a companion project of the DARPA LORELEI program.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs and Incident Language Packs for over two dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons and grammatical resources. Representative languages were selected to provide broad typological coverage, while incident languages were selected to evaluate system performance on a language whose identity was disclosed at the start of the evaluation.

Data

Uzbek is spoken across central Asia; it is the official language of Uzbekistan.

This release is the result of a pilot effort preceding the LORELEI program. Text data was collected in the following genres: news, discussion forum, reference, social network, and weblogs. Both monolingual text collection and parallel text creation involved a combination of manual and automatic methods. Also collected were broadcast news recordings and amateur web audio recordings related to disaster events covered in the text data.

Data volumes are as follows:

47 million words of Uzbek monolingual text, over 886,000 of which were translated into English
563,000 words of found Uzbek-English parallel text
100,000 Uzbek words translated from English text
6.41 hours of Uzbek audio recordings (broadcast news, amateur web recordings)

Approximately 151,000 words were annotated for named entities, and over 28,000 words were annotated for full entity including nominals and pronouns. Noun-phrase chunking was applied to more than 13,000 words and over 20,890 words were labeled with simple semantic annotation. Topic annotation was applied to the audio recordings.

Lexical resources and software tools are also included in this release. The tools recreate original source data from the processed XML material, condition text data users download from Twitter, apply sentence segmentation to raw text, and support named entity tagging.

Monolingual and parallel text are presented in XML with associated dtds. Annotation data is presented as tab delimited files or XML. All text is UTF-8 encoded. The audio recordings are presented in FLAC-compressed MS-WAV and .mp4 format.

Sponsorship

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0123. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA.

Samples

Please view these samples:

Updates

No updates at this time.

Copyright

Portions © 2005 12us.com, © 2012 21Asr.uz, © 2002-2007, 2009-2010 Agence France Presse,© 2013 ajoyib.net, © 2013, 2014 AKIpress News Agency, © 2014 albuxority.com, © 2000 American Broadcasting Company, © 2013 amuziyo.com, © 2014 anon.uz, © 2012, 2014 ARXIV, © 2014 BePuL.NeT, © 2013 bil.uz, © 2013 biznes.daily.uz, © 2013 bizstrener.uz, © 2000 Cable News Network LP, LLLP, © 2012 CDMEP, © 2008 Central News Agency (Taiwan), © 2009 Centre of Hydrometerological Service at Cabinet Ministers of the Republic of Uzbekistan (Uzhydromed), © 2013, 2014 championat.asia, © 2014 darakchi.uz, © 2009, 2011 Daryo, © 2013 Distlik Bayrogi, © 2013 diyormedia.uz, © 2014 DMP under DPE, © 1989 Dow Jones & Company, Inc., © 2010 econews.uz, © 2013 Embassy of the Republic Uzbekistan to the United Kingdom of Great Britain and Northern Ireland, © 2007, 2011 Ferghana News Agency, Moscow, © 2007-2010, 2012-2014 Google LLC, © 2014 Gooper.uz, © 2004-2006 Harakat, © 2012 Human Rights Society of Uzbekistan, © 2011 Huquq, © 2014 Huquq Burch, © 2012 intiqom.uz, © 2009 Islambio.com, © 2006 islom.uz, © 2010 jamiyatgzt.uz, © 2012 kamolon.uz, © 2014 Karachik, © 2014 Kokand, © 2011-2014 Kun.uz, © 2005 Los Angeles Times - Washington Post News Service, Inc., © 2013 LUKOIL Uzbekistan Operating Company LLC, © 2004, 2006 Marifat, © 2013 Medislam, © 2014 megauz.uz, © 2014 mirjahon.weebly.com, © 2013 MoDISaNyntymak, © 2010 Mohiyat, © 2014 Mp3lar.com, © 2014 Mulkdor.com, © 2014 Muloqot, © 2012 muslimaat.uz, © 2000 National Broadcasting Company, Inc., © 2014 National Television and Radio Company of Uzbekistan, © 2011 Navoiy Press, © 2014 news24.uz, © 1999, 2005, 2006, 2010 New York Times, © 2013 Odnoklassniki, © 2014 Oila Davrasida, © 2013 Olam Asia, © 2009 oriftolib.uz, © 2001, 2012 Ozbekiston Elektron Ommaviy Axborot Vositalari Milliy Assotsiatsiyasi, © 2014 pressnews.uz, © 2010 Public Health of Uzbekistan, © 2000 Public Radio International, © 2013 Qadriyat.uz, © 2012 Qashqadaryogz, © 2014 Questpedia, © 2013, 2014 Qulnoma, © 2014 quvnoq.com, © 2011 Rambler, © 2014 Sadolar.net, © 2014 Shamsutdinovs Business Group, © 2014 shejot.com, © 2005 sof-olam.6te.net, © 2012 Software.uz, © 2014 Soglik.Uz, © 2014 Soyabon Group, © 2014 Sports.uz, © 2014 Takewap Group, © 2014 Tarona.net, © 2008 Tashkentskaya Pravda, © 2014 TDPU, © 2009, 2010 Termiz Okshomi, © 2003, 2005-2008, 2010 The Associated Press, © 2013 The GEF Small Grants Program, © 2009 usfayl.com, © 2011, 2012 uskinozal.com, © 2011, 2014 us-world.ru, © 2014 uz24.uz, © 2007-2012 UzA, © 2012 Uzbaby.uz, © 2012 Uzbegim, © 2013 Uzbek.Fm, © 2014 Uzbek Huquq, © 2012 Uzbekislam.com, © 2014 Uzbekistan news- UzReport.uz, © 2012 UZBnews, © 2014 Uzclub.Net, © 2011 UzCinema, © 2011, 2013 Uzfunfactory & Sayyod Media Group, © 2010, 2013 uzhurriyat.com, © 2013 UzLider.Mobi, © 2007, 2011 UZNEWS.NET, © 2012, 2014 Vatandosh, Inc., © 2013 Vatanparvar, © 2013 viloyat-arm.uz, © 2012 www.welcomebackuz.com, © 2014 www.zamonaviy.uz, © 2011 xabar.org, © 2011, 2014 xayol.uz, © 2003, 2005-2008 Xinhua News Agency, © 2012 xorazamtibbiyoti.com, (c) 2014 xs.uz, © 2012 Yangi Dunya, © 2013 zamondosh.uz, © 2014, 2025 Trustees of the University of Pennsylvania