National Cellular Corpus Release 2.3 Center for Spoken Language Understanding UPDATED: 22 September 2002 This document describes the file naming conventions used for this distribution and gives a brief description of the various file formats used. File Name Conventions --------------------- A call is composed of the series of files recorded during each recording session. Every call is identified by a unique call number, and each file in the call is further identified by an utterance type. The filename identifies the call number and the question type. NC000041.WAV The first two capitalized letters, "NC", indicate the corpus, National Cellular. The next 5 digits are the call number. The last digit indicates the utterance type. The utterance types are shown in this list: A background noise B brand C date D date of birth E digital or analog F familiar license plate number G familiar phone number H where did you grow up I handset or microphone (not in vehicle) J last name K location L male or female M native language N phone2 O spell last name P story1 Q story2 R story3 S story4 T story5 U story6 V story7 W story8 X story9 Y thanks Z time 0 week 1 yes or no 2 describe your environment 3 describe the traffic 4 how fast are you going 5 handset or microphone The word "WAV" indicates that this is a speechfile. File Formats ------------------- In older versions of this corpora, the speech files in this distribution are stored as NIST wav files. NIST files have a 1024 byte header followed by the data stored as 8-bit mulaw. There is no additional compression. Since 2.1 release the data is 16-bit linearly encoded Windows wav (riff) format. The data.txt file in the /docs directory is a list of all Of the text transcriptions. Each file transcription is on a separate line. The first value on the line, separated by a single space, is a call number, utterance type, and transcription type triplet. This pair uniquely defines each file. FOR THIS VERSION OF THE CORPUS, THE CONTENTS OF THE DATA.TXT FILE HAVE BEEN EXTRACTED INDIVIDUALLY INTO THE /TRANS DIRECTORY. The /trans directory file structure exactly parallels the structure of the /speech directory. Each file in the /trans directory is in .txt format and contains a line, as described in the previous paragraph, that uniquely defines each corresponding sound file.