Home › Language Resources › Data

MATERIAL Bulgarian-English Language Pack

Item Name:	MATERIAL Bulgarian-English Language Pack
Author(s):	Aric Bills, Judith Bishop, Anne Boyle, Sarra Chouder, Nathaniel Clair, Tom Conners, Cassian Corey, Kristina Cronin, Eyal Dubinski, Corinna Ellis, Paul Gibby, Simon Hammond, Guia Hidalgo, Alice Kaiser-Schatzlein, Dagmara Kalnins, Michael Kazi, Julie Lam, Rosie Lazar, Hanh Le, Nicolas Malyska, Olivia Medel, Jennifer Melot, Alyssa Mensch, Alex Moore, Michelle Morrison, Shelley Paget, Alston Raymer, Fred Richardson, Hristina Ridgway, Annette Roberts, Carl Rubino, Kenneth Saw, Sinney Shen, Stephanie Soh, Jonathan Taylor, Brian Thompson, Audrey Tong, Richard Tong, Mariana Williams, Julie Yelle, Jennifer Yu, Yoanna Zavora, Ilya Zavorin
LDC Catalog No.:	LDC2024S07
ISLRN:	450-346-825-481-3
DOI:	https://doi.org/10.35111/fs0v-4606
Release Date:	July 15, 2024
Member Year(s):	2024
DCMI Type(s):	Sound, Text
Sample Type:	alaw
Sample Rate:	8000
Data Source(s):	telephone conversations
Application(s):	information retrieval, speech recognition
Language(s):	Bulgarian, English
Language ID(s):	bul, eng
License(s):	MATERIAL Bulgarian-English Agreement (For-Profit) MATERIAL Bulgarian-English Agreement (Non-Member) MATERIAL Bulgarian-English Agreement (Not-For-Profit)
Online Documentation:	LDC2024S07 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Bills, Aric, et al. MATERIAL Bulgarian-English Language Pack LDC2024S07. Web Download. Philadelphia: Linguistic Data Consortium, 2024.
Related Works: Hide	View isSimilarWith LDC2024S10 MATERIAL Somali-English Language Pack LDC2024S13 MATERIAL Farsi-English Language Pack LDC2025S01 MATERIAL Georgian-English Language Pack LDC2025S03 MATERIAL Kazakh-English Language Pack LDC2026S01 MATERIAL Swahili-English Language Pack LDC2026S05 MATERIAL Tagalog-English Language Pack

Introduction

MATERIAL Bulgarian-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 78 hours of Bulgarian conversational telephone speech, transcripts, English translations, annotations and queries.

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

Data

The Bulgarian speech in this release represents the Western and Eastern dialects. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 67 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts cover approximately 40% of the speech files, and approximately 10% of the speech files were translated into English. Further information about transcription and translation methodologies is contained in the documentation accompanying this release.

Bulgarian-English Language Pack also includes English queries and their relevance annotations. Annotators marked transcripts by query (simple, conceptual, hybrid) and by their relevance to query search terms.

Speech data is presented either as two channel wav or single channel sphere files, both in 8kHz A-law format. All text data is UTF-8 encoded.

Samples

Please view the following samples:

Updates

None at this time.

Copyright

Portions © 2024 U.S. Government, © 2024 Trustees of the University of Pennsylvania

The U.S. Government acquired this data from Appen which assigned the copyright to the data in the U.S. Government.