Home › Language Resources › Data

CALLHOME American English Lexicon (PRONLEX)

Item Name:	CALLHOME American English Lexicon (PRONLEX)
Author(s):	Paul Kingsbury, Stephanie Strassel, Cynthia McLemore, Robert MacIntyre
LDC Catalog No.:	LDC97L20
ISBN:	1-58563-110-8
ISLRN:	119-159-358-214-6
DOI:	https://doi.org/10.35111/dw6k-n819
Member Year(s):	1994, 1995, 1996, 1997
DCMI Type(s):	Text
Data Source(s):	telephone conversations
Project(s):	EARS, GALE, Hub5-LVCSR
Application(s):	language documentation, parsing, phonology, speech recognition, speech synthesis
Language(s):	English
Language ID(s):	eng
License(s):	CALLHOME Lexicon Agreement (Commercial) CALLHOME Lexicon Agreement (Non-Commercial) CALLHOME Lexicon Agreement (Non-Member)
Online Documentation:	LDC97L20 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Kingsbury, Paul, et al. CALLHOME American English Lexicon (PRONLEX) LDC97L20. Web Download. Philadelphia: Linguistic Data Consortium, 1994.
Related Works: Hide	View hasVersion LDC2026L05 CALLHOME American English Lexicon (PRONLEX) Second Edition hasPart LDC99L23 American English Spoken Lexicon isOutcomeOf LDC97T14 CALLHOME American English Transcripts LDC2026S08 CALLHOME American English Second Edition relatesTo LDC97S42 CALLHOME American English Speech LDC98L21 COMLEX English Syntax Lexicon

Introduction

CALLHOME American English Lexicon (PRONLEX) was developed by the Linguistic Data Consortium (LDC) and contains 90,988 English words with citation-form pronunciations. The words in the lexicon were derived from Wall Street Journal text used in the continuous speech recognition publication series (CSR-1 WSJ0 Complete LDC93S6A), transcripts from the Switchboard telephone collection (LDC97S62), and transcripts representing unscripted telephone conversations between native American English speakers contained in CALLHOME American English Speech Transcripts (LDC97T14).

The CALLHOME series consists of telephone conversations, transcripts and lexicons developed by LDC and Rutgers, The State University of New Jersey, in support of research in speaker identification, language identification and related technologies. Languages in the series include American English, Egyptian Arabic, German, Japanese, Mandarin Chinese, and Spanish.

Data

PRONLEX transcription is a phonemic transcription system designed to support speech recognition by providing a consistent and simplified representation of how words are pronounced in standard American English that allows variation to be generated later to avoid listing many pronunciation variations for each word. This single systematic base form can be expanded through rules or modeling. The transcription was created using a modified ARPABET phoneme set.

The lexicon contains three tab-separated information fields: (1) word: orthographic representation of word; (2) pron: transcribed citation-form pronunciations using modified ARPABET phoneme set; and (3) comments: (OPTIONAL) comment on the entry;

The lexicon is presented as tab-delimited TSV file encoded in UTF-8 format. This release also includes a pronunciation dictionary derived from the lexicon in UTF-8 encoded CMUdict format.

Corresponding transcripts (LDC97T14) and the telephone speech dataset (LDC97S42) are available separately.

CALLHOME American English Lexicon (PRONLEX)

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees