Home › Language Resources › Data

Spoken Digits in Hindi and Indian English

Item Name:	Spoken Digits in Hindi and Indian English
Author(s):	Basabdatta Sen Bhattacharya, Aiswarya Subramanian, Purbayan Chatterjee, Sounak Dey
LDC Catalog No.:	LDC2022S03
ISBN:	1-58563-986-9
ISLRN:	452-404-795-171-3
DOI:	https://doi.org/10.35111/5way-1446
Release Date:	February 15, 2022
Member Year(s):	2022
DCMI Type(s):	Sound
Data Source(s):	field recordings, microphone conversation, web collection
Application(s):	language identification, machine translation, speech recognition
Language(s):	English, Hindi
Language ID(s):	eng, hin
License(s):	Spoken Digits in Hindi and Indian English Agreement
Online Documentation:	LDC2022S03 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Bhattacharya, Basabdatta Sen, et al. Spoken Digits in Hindi and Indian English LDC2022S03. Web Download. Philadelphia: Linguistic Data Consortium, 2022.
Related Works: Hide	View relatesTo LDC2023S07 LDC Spoken Language Sampler - Sixth Release

Introduction

Spoken Digits in Hindi and Indian English was developed by the Birla Institute of Technology and Science Pilani. It contains approximately two hours of speech comprised of spoken digits from one to ten in Hindi and English with regional accents from across India.

Data

The speech data was collected as follows: in person, on a mobile handset recorder app; via one-to-one online communications over social apps; and from social media sites. Each audio file represents a single spoken digit in either Hindi or Indian English. Background noise was mostly retained. Some data was recorded in a noise-free environment or cleaned after recording to avoid abrupt noises such as car horns.

The audio data is organized by number, language and gender. The gender breakdown for speakers is 17% female, 27% male, and 56% unspecified.

A Google Colab Notebook file which can be used for basic functionalities such as removing noise or unwanted spaces is also included in this release.

All audio data is presented as single channel 16-bit 16kHz flac compressed linear PCM.

Spoken Digits in Hindi and Indian English

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees