Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)

Item Name: Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
Author(s): Mohamed Maamouri, Tim Buckwalter, Hubert Jin
LDC Catalog No.: LDC2005S14
ISBN: 1-58563-342-9
ISLRN: 546-803-428-857-5
DOI: https://doi.org/10.35111/a75r-qp57
Release Date: June 15, 2005
Member Year(s): 2005
DCMI Type(s): Sound, Text
Sample Rate: 8000
Data Source(s): telephone conversations
Project(s): EARS, GALE
Language(s): North Levantine Arabic, South Levantine Arabic
Language ID(s): apc, ajp
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2005S14 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Maamouri, Mohamed, Tim Buckwalter, and Hubert Jin. Levantine Arabic QT Training Data Set 4 (Speech + Transcripts) LDC2005S14. Web Download. Philadelphia: Linguistic Data Consortium, 2005.
Related Works: View

Introduction

Levantine Arabic QT Training Data Set 4 (Speech + Transcripts) was developed by the Linguistic Data Consortium (LDC) and contains approximately 138 hours of conversational telephone speech in Levantine Arabic and the associated transcripts.

Data

This release contains 901 call total. The majority of speakers in this corpus are Lebanese. The data is similar to the training data in Set 3: Arabic CTS Levantine Fisher Training Data Set 3, Speech (LDC2005S07) and Arabic CTS Levantine Fisher Training Data Set 3, Transcripts. Here's a breakdown of the dialects and gender distribution for all 901 calls:

Dialect Number of Calls Females Males
Jordanian 171 71 100
Lebanese 1373 511 862
Palestinian 229 71 158
Syrian 29 12 17
Totals 1802 665 1137

All the calls are 2-channel ulaw sphere files with a sample rate of 8 kHz. All the transcripts are in UTF-8 format. The corpus also includes a word list with frequency of occurences. The list shows all the occurences of words in their pronunciation spellings mapped to their corresponding canonical forms, as well as their raw frequency (the amount of times they appear in the corpus) and source document frequency (the number of documents in which they appear in the corpus).

Samples

For an example of the data in this corpus, please view this audio sample (SPH) and transcript sample (TXT).

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee