Names Corpus Release 1.3 Center for Spoken Language Understanding UPDATED: 23 August 2002 Overview -------- A common problem in training and developing speech recognition systems is scarcity of data, especially particular phonemic contexts. The Center for Spoken Language Understanding is attempting to address this problem with the Names Corpus. The Names Corpus is a collection of name utterances, both first and last names, from several thousand different speakers over the telephone. Name utterances are "spontaneous" in that the subject is not reading from a word list. Another area of active research is the development of name Recognition systems. The Names corpus is a useful resource for addressing this problem. The utterances in this corpus were taken from many other Telephone speech data collections that have been completed at the CSLU. In most data collections, the callers were asked to leave their name at some point. Also, the callers would occasionally leave their name in the midst of another utterance. The names in these situations were extracted out of the host utterance and added to the Names Corpus. Each file in the Names corpus has an orthographic Transcription following the CSLU Labeling Conventions. Also, to take advantage of the phonemic variability, many of the utterance have been phonetically transcribed. The selection of files to phonetically transcribe was constrained by a process that selected files that were suspected to contain phonetic contexts that had not yet been transcribed. Release 1.3 of this corpus contains 24245 files. All of these, have been phonetically labeled. Approximately 40% of the bigram phonemic contexts possible, without regard to language constraints, are represented. Description ---------- There is a large variability in the spelling of English names. In the case of common names, plausible spellings were intuitive. However, for the rarer names, we transcribed using an orthography which resembled the pronunciation as closely as possible. We have not attempted to standardize the name spellings. Over the whole corpus there are about 10570 unique names. No standard spellings are used so names such as "kerri" and "kerry" will be counted as two separate tokens. The corpus consists of about 6.3 hours of speech. The following table gives a count of the number of files for each utterance type. Type Number ---------------------- firstname 9727 lastname 11431 other1 151 other2 29 other3 2 other4 1 whole 2659 Recording Conditions -------------------- Each subject called the CSLU data collection system by dialing a toll-free number. Depending on which data collection the caller was calling, the call was recorded over an analog line, or a digital line. The analog telephone line was connected to a Gradient Technologies box. Data from incoming calls were recorded by the Gradient box. The sampling rate was 8khz and the files were stored in 16bit linear format on a UNIX file system. Each utterance was recorded as a separate file. The digital data were collected with the CSLU T1 digital data collection system. The sampling rate was 8khz and the files were stored in 8-bit mu-law format on a UNIX file system. Subject Population ------------------ Subjects whose utterances are included in this corpus are respondents to USEnet postings, radio advertisement, newspaper advertisements, and interoffice memos. Annotation ---------- All of the files included in this corpus have corresponding non-time-aligned word-level transcriptions that comply with the conventions in the CSLU Labeling Guide. In addition, conventions in the Labeling Guide.