Home › Language Resources › Data

ARL Urdu Speech Database, Training Data

Item Name:	ARL Urdu Speech Database, Training Data
Author(s):	Appen Pty Ltd
LDC Catalog No.:	LDC2007S03
ISBN:	1-58563-412-3
ISLRN:	513-040-223-174-0
DOI:	https://doi.org/10.35111/6z57-s580
Release Date:	February 20, 2007
Member Year(s):	2007
DCMI Type(s):	Sound, Text
Sample Type:	pcm
Sample Rate:	22050
Data Source(s):	microphone speech
Language(s):	Urdu
Language ID(s):	urd
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2007S03 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Appen Pty Ltd. ARL Urdu Speech Database, Training Data LDC2007S03. Web Download. Philadelphia: Linguistic Data Consortium, 2007.
Related Works: Hide	View relatesTo LDC2010S07 Asian Spoken Language Sampler

Introduction

ARL Urdu Speech Database, Training Data is a collection of recorded speech with transcripts from 200 adult native Urdu speakers from Pakistan and Northern India and was developed in 2006 by Appen Pty Ltd, Sydney, Australia. The U.S. Army Research Laboratory (ARL) provided this corpus to the Linguistic Data Consortium for distribution.

Urdu is an Indo-Aryan language spoken throughout South Asia that developed under the Mughal Empire and Delhi Sultinate between 1200 AD and 1800 AD. It has Persian, Turkish and Arabic influences, but in fact is a dialect of Hindustani. The word "Urdu" refers to the standardized register of Hindustani, but there are many non-standard idiolects as well. Urdu is the twentieth most spoken language in the world. It is the native language of over 60 million people, it is the offical language of Pakistan, and it is one of India's national languages. Urdu is also spoken in Afghanistan.

The distribution of speaker dialects in the corpus is as follows:

Accent	Number of Speakers
South Sindh	29
North Sindh	30
South Punjab	27
North Punjab	29
Captial Area	29
North West Regions	30
Baluchistan	26

The database is divided into two parts, a training set containing approximately 80% of the data and a test set comprised of 20% of the data. This release consists of approximately 80% of the complete dataset (training and test).

Data

Each speaker was presented with 400 prompts to read: sentences, place names, and person names. Two microphones set at different distances to the speaker were used for the recordings. The recorded speech was stored in raw format files with headers stored in separate directories.

Each utterance was transcribed in the corresponding label file for each recording. The transcriptions were encoded in UTF-8. Punctuation was omitted and numbers were written out in full.

Update

Earlier versions were missing the content list file. This is now available as part of the complete download file.

09/14/18 - The test data for this corpus was originally held back and is now available as part of the download. New downloads after the indicated date will contain the full corpus.

Samples

For an example of the data in this corpus, please listen to this following audio sample (.wav format)

ARL Urdu Speech Database, Training Data

Introduction

Data

Update

Samples

Copyright

Available Media

View Fees