CSLU: Names Release 1.3

Item Name: CSLU: Names Release 1.3
Author(s): Yeshwant Muthusamy, Ronald Allan Cole, Beatrice Oshika
LDC Catalog No.: LDC2006S39
ISBN: 1-58563-394-1
ISLRN: 972-485-703-759-3
DOI: https://doi.org/10.35111/qyw6-w652
Release Date: July 21, 2006
Member Year(s): 2006
DCMI Type(s): Sound, Text
Sample Type: ulaw
Sample Rate: 8000
Data Source(s): telephone speech
Application(s): speech recognition
Language(s): English
Language ID(s): eng
License(s): CSLU Agreement
Online Documentation: LDC2006S39 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Muthusamy, Yeshwant, Ronald Cole, and Beatrice Oshika. CSLU: Names Release 1.3 LDC2006S39. Web Download. Philadelphia: Linguistic Data Consortium, 2006.

Introduction

CSLU: Names Release 1.3 was developed by the Center for Spoken Language Understanding (CSLU) and contains 24,245 files totalling over 6 hours of name utterances, both first and last names, from several thousand different speakers over the telephone along with transcripts.

A common problem in training and developing speech recognition systems is scarcity of data, especially particular phonemic contexts. The CSLU is attempting to address this problem with the Names Corpus. Name utterances are "spontaneous" in that the subject is not reading from a word list.

Another area of active research is the development of name recognition systems. The Names Corpus is a useful resource for addressing this problem.

Data

The utterances in this corpus were taken from many other telephone speech data collections that have been completed at the CSLU. In most data collections, the callers were asked to leave their name at some point. Also, the callers would occasionally leave their name in the midst of another utterance. The names in these situations were extracted out of the host utterance and added to the Names Corpus.

Each file in the Names Corpus has an orthographic transcription following the CSLU Labeling Conventions. Also, to take advantage of the phonemic variability, many of the utterances have been phonetically transcribed. The selection of files to phonetically transcribe was constrained by a process that selected files that were suspected to contain phonetic contexts that had not yet been transcribed.

There are three file formats used in this corpus:

  • The .wav file is a 16-bit, linearly encoded RIFF standard file format.
  • The .txt file is simply an ASCII text file representing the orthographic transcription.
  • The .phn file contains a time aligned phonetic transcription.

Release 1.3 of this corpus contains 24,245 files, all of which have been phonetically labeled. Approximately 40% of the bigram phonemic contexts possible, without regard to language constraints, are represented.

Samples

For an example of the data in this publication, please listen to this audio sample (WAV) and view its transcription (TXT).

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee