MATERIAL Swahili-English Language Pack

Item Name: MATERIAL Swahili-English Language Pack
Author(s): Nicola Amott, Aric Bills, Judith Bishop, Anne Boyle, Sarra Chouder, Nathaniel Clair, Tom Conners, Cassian Corey, Eyal Dubinski, Corinna Ellis, Paul Gibby, Simon Hammond, Luke Hartwig, Maxime Hubert, Vivian Lusweti, Tina Semiti Magembe, Valerie Novak, Maureen Oluoch, Cynthia Onyango, Bella Yahuma, Julie Yelle, Jennifer Yu, Ilya Zavorin
LDC Catalog No.: LDC2026S01
ISLRN: 740-824-029-409-4
DOI: https://doi.org/10.35111/h4s6-3y31
Release Date: January 15, 2026
Member Year(s): 2026
DCMI Type(s): Sound, Text
Sample Type: alaw
Sample Rate: 8000
Data Source(s): telephone conversations
Application(s): information retrieval, speech recognition
Language(s): English, Swahili
Language ID(s): eng, swa
License(s): MATERIAL Swahili-English Agreement (For-Profit)
MATERIAL Swahili-English Agreement (Non-Member)
MATERIAL Swahili-English Agreement (Not-For-Profit)
Online Documentation: LDC2026S01 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Amott, Nicola, et al. MATERIAL Swahili-English Language Pack LDC2026S01. Web Download. Philadelphia: Linguistic Data Consortium, 2026.
Related Works: View

Introduction

MATERIAL Swahili-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 112 hours of Swahili conversational telephone speech, transcripts, English translations, annotations and queries.

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

Data

The Swahili speech in this release represents that spoken in the Nairobi dialect region of Kenya. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 69 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts cover approximately 30% of the speech data, and approximately 3% of the speech data was translated into English. Further information about transcription and translation methodologies is contained in the documentation accompanying this release.

Swahili-English Language Pack also includes domain annotations, English queries and their relevance annotations.

Speech data is presented either as two channel wav or single channel sphere files, predominately in 8kHz A-law format. Some files are 48kHz and single channel. All text data is UTF-8 encoded.

Samples

Please view these samples:

Updates

No updates at this time.

Available Media

View Fees





Login for the applicable fee