MATERIAL Tagalog-English Language Pack
| Item Name: | MATERIAL Tagalog-English Language Pack |
| Author(s): | Aric Bills, Judith Bishop, Anne Boyle, Sarra Chouder, Nathaniel Clair, Tom Conners, Cassian Corey, Eyal Dubinski, Corinna Ellis, Paul Gibby, Simon Hammond, Maxime Hubert, Vanessa Jaralve, Alice Kaiser-Schatzlein, Michael Kazi, Julie Lam, Rosie Lazar, Hanh Le, Michael Levot, Nicolas Malyska, Jennifer Melot, Alyssa Mensch, Valerie Novak, Shelley Paget, Fred Richardson |
| LDC Catalog No.: | LDC2026S05 |
| ISLRN: | 601-036-215-592-1 |
| DOI: | https://doi.org/10.35111/krnv-3035 |
| Release Date: | April 15, 2026 |
| Member Year(s): | 2026 |
| DCMI Type(s): | Sound, Text |
| Sample Type: | alaw |
| Sample Rate: | 8000 |
| Data Source(s): | telephone conversations |
| Application(s): | information retrieval, speech recognition |
| Language(s): | Tagalog, English |
| Language ID(s): | tgl, eng |
| License(s): |
MATERIAL Tagalog-English Agreement (For-Profit) MATERIAL Tagalog-English Agreement (Non-Member) MATERIAL Tagalog-English Agreement (Not-For-Profit) |
| Online Documentation: | LDC2026S05 Documents |
| Licensing Instructions: | Subscription & Standard Members, and Non-Members |
| Citation: | Bills, Aric, et al. MATERIAL Tagalog-English Language Pack LDC2026S05. Web Download. Philadelphia: Linguistic Data Consortium, 2026. |
| Related Works: | View |
Introduction
MATERIAL Tagalog-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 100 hours of Tagalog conversational telephone speech, transcripts, English translations, annotations and queries.
The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.
Data
The Tagalog speech in this release represents that spoken in the North, Central and South dialect regions in the Philippines. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 62 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.
Transcripts cover approximately 30% of the speech data, and approximately 2% of the speech data was translated into English. Further information about transcription and translation methodologies is contained in the documentation accompanying this release.
Tagalog-English Language Pack also includes domain annotations, English queries and their relevance annotations. Annotators marked transcripts by domain (e.g., lifestyle, business-and-commerce, sports, education, and so on), by query (simple, conceptual, hybrid) and by their relevance to query search terms.
Speech data is presented either as two channel wav or single channel sphere files, predominately in 8kHz A-law format. Some wav files are 48kHz and single channel. All text data is UTF-8 encoded.
Samples
Please view these samples:
Updates
No updates at this time.