MATERIAL Farsi-English Language Pack

Item Name: MATERIAL Farsi-English Language Pack
Author(s): Aric Bills, Sarra Chouder, Cassian Corey, Marjan Davoodian, Eyal Dubinski, Corinna Ellis, Reza Farnam, Paul Gibby, Luke Hartwig, Dagmara Kalnins, Michael Kazi, Julie Lam, Hanh Le, Nicolas Malyska, Sarah Marvi, Sara McConnell, Jennifer Melot, Alyssa Mensch, Alex Moore, Michelle Morrison, Shelley Paget, Frederick Richardson, Annette Roberts, Carl Rubino, Marjan Sadeghi Moaddel, Bern Samko, Kenneth Saw, Pradeepti Sen, Rosanna Smith, Jonathan Taylor, Brian Thompson, Audrey Tong, Richard Tong, Andrew Weller, Sasha Wilmoth, Jennifer Yu, Ilya Zavorin
LDC Catalog No.: LDC2024S13
ISLRN: 202-347-751-598-9
DOI: https://doi.org/10.35111/7dhe-8213
Release Date: December 16, 2024
Member Year(s): 2024
DCMI Type(s): Sound, Text
Sample Type: alaw
Sample Rate: 8000
Data Source(s): telephone conversations
Application(s): information retrieval, speech recognition
Language(s): English, Persian
Language ID(s): eng, fas
License(s): MATERIAL Farsi-English Agreement (For-Profit)
MATERIAL Farsi-English Agreement (Non-Member)
MATERIAL Farsi-English Agreement (Not-For-Profit)
Online Documentation: LDC2024S13 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Bills, Aric, et al. MATERIAL Farsi-English Language Pack LDC2024S13. Web Download. Philadelphia: Linguistic Data Consortium, 2024.
Related Works: View

Introduction

MATERIAL Farsi-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 61 hours of Farsi conversational telephone speech, transcripts, English translations, annotations and queries.

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

Data

The Farsi speech in this release represents that spoken in the Greater Tehran, Central/Southwest, Northeast, and Northwest dialect regions of Iran, as well as a standard formal dialect in use throughout the country. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 67 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts cover approximately a third of the speech data, and approximately 3% of the speech data was translated into English. Further information about transcription and translation methodologies is contained in the documentation accompanying this release.

Farsi-English Language Pack also includes English queries and their relevance annotations. Annotators marked transcripts by query (simple, conceptual, hybrid) and by their relevance to query search terms.

Speech data is presented either as two channel wav or single channel sphere files, both in 8kHz A-law format. All text data is UTF-8 encoded.

Samples

Please view the following samples:

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee