CALLHOME American English Lexicon (PRONLEX)
| Item Name: | CALLHOME American English Lexicon (PRONLEX) |
| Author(s): | Paul Kingsbury, Stephanie Strassel, Cynthia McLemore, Robert MacIntyre |
| LDC Catalog No.: | LDC97L20 |
| ISBN: | 1-58563-110-8 |
| ISLRN: | 119-159-358-214-6 |
| DOI: | https://doi.org/10.35111/dw6k-n819 |
| Member Year(s): | 1994, 1995, 1996, 1997 |
| DCMI Type(s): | Text |
| Data Source(s): | telephone conversations |
| Project(s): | EARS, GALE, Hub5-LVCSR |
| Application(s): | speech recognition |
| Language(s): | English |
| Language ID(s): | eng |
| License(s): |
CALLHOME Lexicon Agreement (Commercial) CALLHOME Lexicon Agreement (Non-Commercial) CALLHOME Lexicon Agreement (Non-Member) |
| Online Documentation: | LDC97L20 Documents |
| Licensing Instructions: | Subscription & Standard Members, and Non-Members |
| Citation: | Kingsbury, Paul, et al. CALLHOME American English Lexicon (PRONLEX) LDC97L20. Web Download. Philadelphia: Linguistic Data Consortium, 1994. |
| Related Works: | View |
Introduction
CALLHOME American English Lexicon (PRONLEX) was developed by the Linguistic Data Consortium (LDC) and contains 90,988 English words with citation-form pronunciations. The words in the lexicon were derived from Wall Street Journal text used in the continuous speech recognition publication series (CSR-1 WSJ0 Complete LDC93S6A), transcripts from the Switchboard telephone collection (LDC97S62), and transcripts representing unscripted telephone conversations between native American English speakers contained in CALLHOME American English Speech Transcripts (LDC97T14).
The CALLHOME series consists of telephone conversations, transcripts and lexicons developed by LDC and Rutgers, The State University of New Jersey, in support of research in speaker identification, language identification and related technologies. Languages in the series include American English, Egyptian Arabic, German, Japanese, Mandarin Chinese, and Spanish.
Data
PRONLEX transcription is a phonemic transcription system designed to support speech recognition by providing a consistent and simplified representation of how words are pronounced in standard American English that allows variation to be generated later to avoid listing many pronunciation variations for each word. This single systematic base form can be expanded through rules or modeling. The transcription was created using a modified ARPABET phoneme set.
The lexicon contains three tab-separated information fields: (1) word: orthographic representation of word; (2) pron: transcribed citation-form pronunciations using modified ARPABET phoneme set; and (3) comments: (OPTIONAL) comment on the entry;
The lexicon is presented as tab-delimited TSV file encoded in UTF-8 format. This release also includes a pronunciation dictionary derived from the lexicon in UTF-8 encoded CMUdict format.
Corresponding transcripts (LDC97T14) and the telephone speech dataset (LDC97S42) are available separately.
Samples
Please view this sample page.
Updates
There are no updates at this time.