MATERIAL Kazakh-English Language Pack

Item Name: MATERIAL Kazakh-English Language Pack
Author(s): Gulnar Bekkozhanova, Aric Bills, Sarra Chouder, Vanessa Jaralve, Cassian Corey, Eyal Dubinski, Corinna Ellis, Paul Gibby, Michael Kazi, Julie Lam, Hanh Le, Nicolas Malyska, Giorgia Marcucci, Sarah Marvi, Sara McConnell, Jennifer Melot, Alyssa Mensch, Michelle Morrison, Shelley Paget, Katerina Ramizo, Frederick Richardson, Annette Roberts, Carl Rubino, Gulnar Sarseke, Zharas Taubayev
LDC Catalog No.: LDC2025S03
ISLRN: 798-646-667-992-4
DOI: https://doi.org/10.35111/k4ey-kj75
Release Date: April 15, 2025
Member Year(s): 2025
DCMI Type(s): Sound, Text
Sample Type: alaw
Sample Rate: 8000
Data Source(s): telephone conversations
Application(s): information retrieval, speech recognition
Language(s): English, Kazakh
Language ID(s): eng, kaz
License(s): MATERIAL Kazakh-English Agreement (For-Profit)
MATERIAL Kazakh-English Agreement (Non-Member)
MATERIAL Kazakh-English Agreement (Not-For-Profit)
Online Documentation: LDC2025S03 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Bekkozhanova, Gulnar, et al. MATERIAL Kazakh-English Language Pack LDC2025S03. Web Download. Philadelphia: Linguistic Data Consortium, 2025.
Related Works: View

Introduction

MATERIAL Kazakh-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 57 hours of Kazakh conversational telephone speech, transcripts, English translations, annotations and queries.

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

Data

The Kazakh speech in this release represents that spoken in the Northern and Southern dialect regions of Kazakhstan. Speakers were 18 years of age or older. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts cover approximately 17% of the speech data, all of which was translated into English. Further information about transcription and translation methodologies is contained in the documentation accompanying this release.

Kazakh-English Language Pack also includes English queries and their relevance annotations. Annotators marked transcripts by query (simple, conceptual, hybrid) and by their relevance to query search terms.

Speech data is presented mostly as two channel wav or single channel sphere files, both in 8kHz A-law format. Some wav files are 48kHz PCM. All text data is UTF-8 encoded.

Samples

 

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee