MATERIAL Somali-English Language Pack

Item Name: MATERIAL Somali-English Language Pack
Author(s): Zeinab Abdi, Zahra Ali, Aric Bills, Judith Bishop, Anne Boyle, Sarra Chouder, Nathaniel Clair, Tom Conners, Cassian Corey, Eyal Dubinski, Corinna Ellis, Jess Fernando, Paul Gibby, Farah H Abdi, Simon Hammond, Maxime Hubert, Alice Kaiser-Schatzlein, Michael Kazi, Julie Lam, Rosie Lazar, Hanh Le, Michael Levot, Nicolas Malyska, Jennifer Melot, Alyssa Mensch, Abdulkadir Arale Omar, Shelley Paget, Frederick Richardson, Carl Rubino, Bern Samko, Gregory Sanders, Stephanie Soh, Tania E. Strahan, Jonathan Taylor, Brian Thompson, Audrey Tong, Richard Tong, Julie Yelle, Jennifer Yu, Ilya Zavorin
LDC Catalog No.: LDC2024S10
ISLRN: 462-281-226-328-3
DOI: https://doi.org/10.35111/5550-f323
Release Date: September 16, 2024
Member Year(s): 2024
DCMI Type(s): Sound, Text
Sample Type: alaw
Sample Rate: 8000
Data Source(s): telephone conversations
Application(s): information retrieval, speech recognition
Language(s): Somali, English
Language ID(s): som, eng
License(s): MATERIAL Somali-English Agreement (For-Profit)
MATERIAL Somali-English Agreement (Non-Member)
MATERIAL Somali-English Agreement (Not-For-Profit)
Online Documentation: LDC2024S10 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Abdi, Zeinab, et al. MATERIAL Somali-English Language Pack LDC2024S10. Web Download. Philadelphia: Linguistic Data Consortium, 2024.
Related Works: View

Introduction

MATERIAL Somali-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 80 hours of Somali conversational telephone speech, transcripts, English translations, annotations and queries.

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

Data

The Somali speech in this release represents that spoken in the Northern and Benaadir dialect regions of Somalia. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 60 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts cover approximately 10% of the speech data, and approximately 4% of the speech data was translated into English. Further information about transcription and translation methodologies is contained in the documentation accompanying this release.

Somali-English Language Pack also includes domain annotations, English queries and their relevance annotations. Annotators marked transcripts by domain (e.g., lifestyle, business-and-commerce, sports, education, and so on), by query (simple, conceptual, hybrid) and by their relevance to query search terms.

Speech data is presented either as two channel wav or single channel sphere files, predominately in 8kHz A-law format, with some wav files at a sample rate of 48kHz. All text data is UTF-8 encoded.

Samples

Please view the following samples:

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee