Alphadigit Corpus Release 1.3 Center for Spoken Language Understanding UPDATED: 23 August 2002 Overview -------- This release includes recorded utterances from 3025 different callers and a transcription of each utterance. There are a total of 78044 speech files. All of the files included in this corpus have corresponding non-time-aligned word-level transcriptions, time aligned phoneme- level transcriptions (automatic forced alignment), that comply with the conventions in the CSLU Labeling Guide. Recording Conditions -------------------- Each subject called the CSLU data collection system by dialing a toll-free number. The data were recorded directly off of a digital phone line without digital-to-analog or analog-to-digital conversion at the recording end. The digital data were collected with the CSLU T1 digital data collection system described in "Digital Data Collection at CSLU" (please see our web site). The sampling rate was 8khz and the files were stored in 8-bit mu-law format on a UNIX file system. These files have been converted to the RIFF standard file format. This file format is 16-bit linearly encoded. Subject Population ------------------ Subjects whose utterances are included in this corpus are respondents to USEnet postings. Respondants were required to fill out a form on the World Wide Web and register for the data collection. In response to their registration a list of letters and digits was emailed to them along with instructions on how to participate. File Naming Conventions ----------------------- Each utterance is stored in an individual file, whose name indicates the language and session number of the caller. For example: AD-1.p22.wav The first field ("AD") is the prefix indicating the corpus to which this data belongs, the second field ("1") represents a unique ID number for the speaker, and the third field ("p22") indicates the prompt to which the speaker was responding. Protocol ------------------ Each participant was given a list of six digit strings to read over the phone. The participants called the system and were prompted for each string. The strings contained both digits and letters ("a 2 b 4 5 g", for example). 1102 different strings were used throughout the course of the data collection. See docs/lists.txt for the complete lists. The lists were set up to balance for phonetic context between all letter and digit pairs. Many of the letters and digits share "phonetic ccontext" on the left or right side. For example, "p" and "3" both end in an "ee" sound so they share the right context. There were fourteen groups sharing right context and nineteen groups sharing left context. Shared right context 0 a,j,k 1 b,c,d,e,g,p,t,v,z,3 2 f 3 h 4 i,y 5 l 6 m 7 n,1,7,9 8 0,o 9 q,u,w,2 10 r,4 11 x,6,s 12 5 13 8 Shared left context 0 a,h,8 1 b 2 c,6,7 3 d,w 4 e 5 n,f,l,m,x,s 6 g,j 7 i,r 8 k,q 9 o 10 p 11 2,t 12 u 13 v 14 y,1 15 0,z 16 3 17 4,5 18 9 After having established the context groups, a list of strings was chosen that provided even coverage of all the phone-context pairs and that provided a reasonably balanced number of each token. This long list of strings was split into several smaller lists of 18-29 strings. These small lists were sent to participants as they registered.