CALLHOME American English Lexicon (PRONLEX)

Item Name: CALLHOME American English Lexicon (PRONLEX)
Author(s): Paul Kingsbury, Stephanie Strassel, Cynthia McLemore, Robert MacIntyre
LDC Catalog No.: LDC97L20
ISBN: 1-58563-110-8
ISLRN: 119-159-358-214-6
DOI: https://doi.org/10.35111/dw6k-n819
Member Year(s): 1994, 1995, 1996, 1997
DCMI Type(s): Text
Data Source(s): telephone conversations
Project(s): EARS, GALE, Hub5-LVCSR
Application(s): speech recognition
Language(s): English
Language ID(s): eng
License(s): CALLHOME Lexicon Agreement (Commercial)
CALLHOME Lexicon Agreement (Non-Commercial)
CALLHOME Lexicon Agreement (Non-Member)
Online Documentation: LDC97L20 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Kingsbury, Paul, et al. CALLHOME American English Lexicon (PRONLEX) LDC97L20. Web Download. Philadelphia: Linguistic Data Consortium, 1994.
Related Works: View

Introduction

CALLHOME American English Lexicon (PRONLEX) was developed by the Linguistic Data Consortium (LDC) and contains 90,988 English words with citation-form pronunciations. The words in the lexicon were derived from Wall Street Journal text used in the continuous speech recognition publication series (CSR-1 WSJ0 Complete LDC93S6A), transcripts from the Switchboard telephone collection (LDC97S62), and transcripts representing unscripted telephone conversations between native American English speakers contained in CALLHOME American English Speech Transcripts (LDC97T14).

The CALLHOME series consists of telephone conversations, transcripts and lexicons developed by LDC and Rutgers, The State University of New Jersey, in support of research in speaker identification, language identification and related technologies. Languages in the series include American English, Egyptian Arabic, German, Japanese, Mandarin Chinese, and Spanish.

Data

PRONLEX transcription is a phonemic transcription system designed to support speech recognition by providing a consistent and simplified representation of how words are pronounced in standard American English that allows variation to be generated later to avoid listing many pronunciation variations for each word. This single systematic base form can be expanded through rules or modeling. The transcription was created using a modified ARPABET phoneme set.

The lexicon contains three tab-separated information fields: (1) word: orthographic representation of word; (2) pron: transcribed citation-form pronunciations using modified ARPABET phoneme set; and (3) comments: (OPTIONAL) comment on the entry;

The lexicon is presented as tab-delimited TSV file encoded in UTF-8 format. This release also includes a pronunciation dictionary derived from the lexicon in UTF-8 encoded CMUdict format.

Corresponding transcripts (LDC97T14) and the telephone speech dataset (LDC97S42) are available separately.

Samples

Please view this sample page.

Updates

There are no updates at this time.

Available Media

View Fees





Login for the applicable fee