National Cellular Corpus
                            Release 2.3

              Center for Spoken Language Understanding


UPDATED: 22 September 2002


Directory Structure
-------------------
This document describes the directory structure of this
release. 

Following is a written description of the directory
structure in this release:

  readme.txt	General information regarding the corpus.

  docs/		The documentation directory. This
		directory contains further documentation
		for the National Cellular corpus.

  labels/	Phonetic labeling directory. This directory
                contains time aligned phoneme-level
                transcriptions (automatic forced alignment).

  misc/		Miscellaneous directory, possibly
		containing software tools and scripts.

  speech/	The speech directory contains the actual 
		.wav files. There are many labeled
		subdirectories within the speech
		directory.

  trans/	The transcriptions directory. This
		directory contains non-time-aligned word
		level transcriptions for each of the
		speech files.

This corpus requires approximately 3.4GB of disk space.

Visually, the directory structure looks something like
this:

			  natcell
			     |
   --------------------------------------------------
   |           |        |        |        |         |
readme.txt   /docs   /labels   /misc   /speech   /trans

The /speech directory contains the speech data.  The files
Are divided into sub-directories based on their call
number.  Files with call number 0-9 are in sub-directory
"0", files with call number 10-19 are in sub-directory
"1", etc.

The /trans directory contains the orthographic
transcription of each of the files.  As with the speech
files, the transcription files are divided into sub
directories based on their call number.  Files with
call number 0-9 are in sub-directory "0", files with call
number 10-19 are in sub-directory "1", etc. (A file
called data.txt, containing all of the transcriptions, is
located in the /docs directory.)

File Name Conventions
---------------------
A call is composed of the series of files recorded during
each recording session. Every call is identified by a
unique call number, and each file in the call is further
identified by an utterance type.

The filename identifies the call number and the question
type. 

     NC000041.WAV 

The first two capitalized letters, "NC", indicate the
corpus, National Cellular.  The next 5 digits are the call
number. The last digit indicates the utterance type. The
utterance types are shown in this list:

  A  background noise 
  B  brand 
  C  date 
  D  date of birth 


  E  digital or analog 
  F  familiar license plate number 
  G  familiar phone number 
  H  where did you grow up 


  I  handset or microphone (not in vehicle) 
  J  last name 
  K  location 
  L  male or female 


  M  native language 
  N  phone2 
  O  spell last name 
  P  story1 


  Q  story2 
  R  story3 
  S  story4 
  T  story5 


  U  story6 
  V  story7 
  W  story8 
  X  story9 


  Y  thanks 
  Z  time 
  0  week 
  1  yes or no 


  2  describe your environment 
  3  describe the traffic 
  4  how fast are you going 
  5  handset or microphone 
 
The word "WAV" indicates that this is a speechfile.