Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
|Item Name:||Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)|
|Author(s):||Mohamed Maamouri, Tim Buckwalter, Hubert Jin|
|LDC Catalog No.:||LDC2005S14|
|Release Date:||June 15, 2005|
|DCMI Type(s):||Sound, Text|
|Data Source(s):||telephone conversations|
|Language(s):||North Levantine Arabic, South Levantine Arabic|
|Language ID(s):||apc, ajp|
LDC User Agreement for Non-Members
|Online Documentation:||LDC2005S14 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Maamouri, Mohamed, Tim Buckwalter, and Hubert Jin. Levantine Arabic QT Training Data Set 4 (Speech + Transcripts) LDC2005S14. Web Download. Philadelphia: Linguistic Data Consortium, 2005.|
Levantine Arabic QT Training Data Set 4 (Speech + Transcripts) was developed by the Linguistic Data Consortium (LDC) and contains approximately 138 hours of conversational telephone speech in Levantine Arabic and the associated transcripts.
This release contains 901 call total. The majority of speakers in this corpus are Lebanese. The data is similar to the training data in Set 3: Arabic CTS Levantine Fisher Training Data Set 3, Speech (LDC2005S07) and Arabic CTS Levantine Fisher Training Data Set 3, Transcripts. Here's a breakdown of the dialects and gender distribution for all 901 calls:
|Dialect||Number of Calls||Females||Males|
All the calls are 2-channel ulaw sphere files with a sample rate of 8 kHz. All the transcripts are in UTF-8 format. The corpus also includes a word list with frequency of occurences. The list shows all the occurences of words in their pronunciation spellings mapped to their corresponding canonical forms, as well as their raw frequency (the amount of times they appear in the corpus) and source document frequency (the number of documents in which they appear in the corpus).
None at this time.