National Cellular Corpus Release 2.3 Center for Spoken Language Understanding UPDATED: 22 September 2002 Directory Structure ------------------- This document describes the directory structure of this release. Following is a written description of the directory structure in this release: readme.txt General information regarding the corpus. docs/ The documentation directory. This directory contains further documentation for the National Cellular corpus. labels/ Phonetic labeling directory. This directory contains time aligned phoneme-level transcriptions (automatic forced alignment). misc/ Miscellaneous directory, possibly containing software tools and scripts. speech/ The speech directory contains the actual .wav files. There are many labeled subdirectories within the speech directory. trans/ The transcriptions directory. This directory contains non-time-aligned word level transcriptions for each of the speech files. This corpus requires approximately 3.4GB of disk space. Visually, the directory structure looks something like this: natcell | -------------------------------------------------- | | | | | | readme.txt /docs /labels /misc /speech /trans The /speech directory contains the speech data. The files Are divided into sub-directories based on their call number. Files with call number 0-9 are in sub-directory "0", files with call number 10-19 are in sub-directory "1", etc. The /trans directory contains the orthographic transcription of each of the files. As with the speech files, the transcription files are divided into sub directories based on their call number. Files with call number 0-9 are in sub-directory "0", files with call number 10-19 are in sub-directory "1", etc. (A file called data.txt, containing all of the transcriptions, is located in the /docs directory.) File Name Conventions --------------------- A call is composed of the series of files recorded during each recording session. Every call is identified by a unique call number, and each file in the call is further identified by an utterance type. The filename identifies the call number and the question type. NC000041.WAV The first two capitalized letters, "NC", indicate the corpus, National Cellular. The next 5 digits are the call number. The last digit indicates the utterance type. The utterance types are shown in this list: A background noise B brand C date D date of birth E digital or analog F familiar license plate number G familiar phone number H where did you grow up I handset or microphone (not in vehicle) J last name K location L male or female M native language N phone2 O spell last name P story1 Q story2 R story3 S story4 T story5 U story6 V story7 W story8 X story9 Y thanks Z time 0 week 1 yes or no 2 describe your environment 3 describe the traffic 4 how fast are you going 5 handset or microphone The word "WAV" indicates that this is a speechfile.