CSLU: Numbers Version 1.3, Linguistic Data Consortium (LDC) catalog number LDC2009S01 and isbn 1-58563-501-4, was created by the Center for Spoken Language Understanding (CSLU) at OGI School of Science and Engineering, Oregon Health and Science University, Beaverton, Oregon. It is a collection of naturally produced numbers taken from utterances in various CSLU telephone speech data collections. The corpus consists of approximately fifteen hours of speech and includes isolated digit strings, continuous digit strings, and ordinal/cardinal numbers.
The numbers have several sources, among them, phone numbers, numbers from street addresses and zip codes, uttered by 12618 speakers in a total of 23902 files. In most of CSLU's telephone data collections, callers were asked for their phone number, birthdate or zip code. Callers would also occasionally leave numbers in the midst of another utterance. The numbers in those situations were extracted from the host utterance and added to the corpus.
Additional information about this publication is available from the corpus web page at CSLU.
The speech data was collected over analog and digital telephone lines. The analog data was recorded using a Gradient Technologies analog-to-digital conversion box; those files were recorded as 16-bit, 8 khz and stored in a linear format. The digital data was recorded with the CSLU T1 digital data collection system; those files were sampled at 8khz, 8-bit and stored as ulaw files. All of the data in this release has been linearly encoded in 16-bit RIFF standard file format.
Each file includes an orthographic transcription following the CSLU Labeling guidelines which are included in the documentation for this publication. Also, many of the utterances have been phonetically labeled.
Statistics: CSLU: Numbers Version 1.3 consists of approximately fifteen hours of speech. The following table gives a count of the number of files for each utterance type.
| Type || Number |
| phone || 2970 |
| street || 7079 |
| zipcode || 7076 |
| other || 6771 |
For an example of the data contained in this corpus, please examine the audio files and labels for the following spoken sequences
Portions © 1998, 2000, 2002 Center for Spoken Language Understanding, Oregon Health & Science University, © 2009 Trustees of the University of Pennsylvania