Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
Item Name: | Levantine Arabic QT Training Data Set 4 (Speech + Transcripts) |
Author(s): | Mohamed Maamouri, Tim Buckwalter, Hubert Jin |
LDC Catalog No.: | LDC2005S14 |
ISBN: | 1-58563-342-9 |
ISLRN: | 546-803-428-857-5 |
DOI: | https://doi.org/10.35111/a75r-qp57 |
Release Date: | June 15, 2005 |
Member Year(s): | 2005 |
DCMI Type(s): | Sound, Text |
Sample Rate: | 8000 |
Data Source(s): | telephone conversations |
Project(s): | EARS, GALE |
Language(s): | North Levantine Arabic, South Levantine Arabic |
Language ID(s): | apc, ajp |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2005S14 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Maamouri, Mohamed, Tim Buckwalter, and Hubert Jin. Levantine Arabic QT Training Data Set 4 (Speech + Transcripts) LDC2005S14. Web Download. Philadelphia: Linguistic Data Consortium, 2005. |
Related Works: | View |
Introduction
Levantine Arabic QT Training Data Set 4 (Speech + Transcripts) was developed by the Linguistic Data Consortium (LDC) and contains approximately 138 hours of conversational telephone speech in Levantine Arabic and the associated transcripts.
Data
This release contains 901 call total. The majority of speakers in this corpus are Lebanese. The data is similar to the training data in Set 3: Arabic CTS Levantine Fisher Training Data Set 3, Speech (LDC2005S07) and Arabic CTS Levantine Fisher Training Data Set 3, Transcripts. Here's a breakdown of the dialects and gender distribution for all 901 calls:
Dialect | Number of Calls | Females | Males |
---|---|---|---|
Jordanian | 171 | 71 | 100 |
Lebanese | 1373 | 511 | 862 |
Palestinian | 229 | 71 | 158 |
Syrian | 29 | 12 | 17 |
Totals | 1802 | 665 | 1137 |
All the calls are 2-channel ulaw sphere files with a sample rate of 8 kHz. All the transcripts are in UTF-8 format. The corpus also includes a word list with frequency of occurences. The list shows all the occurences of words in their pronunciation spellings mapped to their corresponding canonical forms, as well as their raw frequency (the amount of times they appear in the corpus) and source document frequency (the number of documents in which they appear in the corpus).
Samples
For an example of the data in this corpus, please view this audio sample (SPH) and transcript sample (TXT).
Updates
None at this time.