Home › Language Resources › Data

Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)

Item Name:	Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
Author(s):	Mohamed Maamouri, Tim Buckwalter, Hubert Jin
LDC Catalog No.:	LDC2005S14
ISBN:	1-58563-342-9
ISLRN:	546-803-428-857-5
DOI:	https://doi.org/10.35111/a75r-qp57
Release Date:	June 15, 2005
Member Year(s):	2005
DCMI Type(s):	Sound, Text
Sample Rate:	8000
Data Source(s):	telephone conversations
Project(s):	EARS, GALE
Language(s):	North Levantine Arabic, South Levantine Arabic
Language ID(s):	apc, ajp
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2005S14 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Maamouri, Mohamed, Tim Buckwalter, and Hubert Jin. Levantine Arabic QT Training Data Set 4 (Speech + Transcripts) LDC2005S14. Web Download. Philadelphia: Linguistic Data Consortium, 2005.
Related Works: Hide	View isContinuationOf LDC2005S07 Arabic CTS Levantine Fisher Training Data Set 3, Speech LDC2005T03 Arabic CTS Levantine Fisher Training Data Set 3, Transcripts hasContinuation LDC2006S29 Levantine Arabic QT Training Data Set 5, Speech LDC2006T07 Levantine Arabic QT Training Data Set 5, Transcripts

Introduction

Levantine Arabic QT Training Data Set 4 (Speech + Transcripts) was developed by the Linguistic Data Consortium (LDC) and contains approximately 138 hours of conversational telephone speech in Levantine Arabic and the associated transcripts.

Data

This release contains 901 call total. The majority of speakers in this corpus are Lebanese. The data is similar to the training data in Set 3: Arabic CTS Levantine Fisher Training Data Set 3, Speech (LDC2005S07) and Arabic CTS Levantine Fisher Training Data Set 3, Transcripts. Here's a breakdown of the dialects and gender distribution for all 901 calls:

Dialect	Number of Calls	Females	Males
Jordanian	171	71	100
Lebanese	1373	511	862
Palestinian	229	71	158
Syrian	29	12	17
Totals	1802	665	1137

All the calls are 2-channel ulaw sphere files with a sample rate of 8 kHz. All the transcripts are in UTF-8 format. The corpus also includes a word list with frequency of occurences. The list shows all the occurences of words in their pronunciation spellings mapped to their corresponding canonical forms, as well as their raw frequency (the amount of times they appear in the corpus) and source document frequency (the number of documents in which they appear in the corpus).

Samples

For an example of the data in this corpus, please view this audio sample (SPH) and transcript sample (TXT).

Updates

None at this time.

Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees