Kelsey Taussig, Project Leader

July 13, 1994

Project overview

Macrophone was an effort that produced a large corpus of telephone speech appropriate to the development of automatic voice-interactive telephone services. The corpus includes over 200,000 transcribed utterances from over 5000 speakers. All data was collected in 8-bit mulaw digital form directly from T1 telephone channels.

Summary of work.

This project was divided into three phases: setup, collection, and file preparation.

Tasks for the setup phase included:

  1. Procure and configure requisite collection hardware and software.

    We purchased and installed the hardware required for data collection which included a two-sided printer, two 2-GByte disks, an exabyte drive and exabyte tapes. We also updated the collection software and the software that generated the prompt sheets.

  2. Finalize material design.

    The material design was finalized to the following specification:

    Of the 34 read utterances, we specified

    Of the 11 spontaneous responses, we specified Although comments were collected as part of Macrophone, they were not transcribed and are not part of the 200,000 delivered utterances.

    A separate document entitled Macrophone Materials contains a detailed description of the material and is included with this report.

  3. Print sheets for mailing

    In total, 22,000 unique sheets were printed for mailing. Twenty thousand (20,000) were printed originally, and then an additional 2000 were printed to compensate for a low response rate in the 18-28 year old category.

  4. Settle on a sample and schedule with the market research mailing firm.

    We received bids from Market Facts, NPD (formerly HTI), and NFO. We accepted NFO's bid of $24,000 for 2000 mailings. We later made an additional mailing of 2000 targeted at 18-28 year olds at a cost of $3000.

    We specified the following respondent characteristics:

    Due to NFO's underestimation of the response rate of 10-18 year olds, we received many more calls from juveniles than desired. Those calls were selectively filtered out for a more desirable age distribution.

    Due to the higher household income specification, we received many fewer calls from 18-28 year olds than desired. In an effort to smooth the age distribution, we sent an additional mailing of 2000 to that age group.

Tasks for the collection phase included:
  1. Monitor and store 5,000 incoming calls.

    We collected 6700 calls from the original 20,000 mailings, at a 33% response rate.

    The additional mailing of 2000 to 18-28 year olds resulted in 310 calls of which only 201 were in the target age group. We suspected the 109 callers out of the targeted age group were parents calling in for their children. The response rate of 10%, suggests that perhaps university postings or electronic bulletin boards might be better sources than panel houses for this particular age group.

  2. Hire and train temporary workers for file verification and transcription.

    We hired and trained 6 half-time transcribers. Their training instructions were distributed to Jack Godfrey, and are included at the end of the report.

  3. Transcribe demographic information from calls; write to headers and archive.

    Demographic information was transcribed for all of the collected calls.

    Demographic information included a gender decision made by the transcriber as well as responses to the following utterances:

    The sheet identifier (in the form of a 10-digit telephone number) was also transcribed at this time. Since each of the 22,000 sheets contained a unique set of read material, the sheet identifier was used to supply the default transcriptions for the read utterances for a particular sheet.

    SPHERE headers were written for all files. A typical file header is shown below:

    	birthday -s6 530317
    	speaking_mode -s4 read
    	caller_id -s8 11001248
    	non_native_speaker -s2 no
    	cordless_phone -s2 no
    	gender -s4 male
    	panel_number -s7 0137631
    	sheet_identifier_1 -s10 4454978797
    	recording_date -s6 930914
    	recording_time -s6 130719
    	database_id -s10 MACROPHONE
    	database_version -s3 1.0
    	microphone -s9 telephone
    	sample_rate -i 8000
    	sample_count -i 61504
    	channel_count -i 1
    	sample_n_bytes -i 1
    	sample_sig_bits -i 8
    	sample_byte_format -s6 mu-law
    	prompt_text -s26 Say the credit card number
    	transcription -s69 two three two four dash six six oh seven \
    		dash three three three three
    	response_category -s6 digits
  4. Prepare and package groups of utterance files for shipment.

    We delivered 204,160 utterances from 5005 callers. The transcription conventions were documented and delivered to the LDC and are included at the conclusion of this report.

    Utterances from an additional 292 callers were transcribed and later discarded due to either a low number of acceptable utterances or the age of the caller.

    In order to insure the accuracy of the transcriptions, we used a two step checking process. The first step was an automatic check that corrected spelling, typos, and other known problems. The second step was the manual verification of all utterances in categories where we expected the most transcriptions errors. A study of 28,000 utterances showed that slightly over 5% of all transcribed utterances contained transcription errors, of which .5% were spelling errors or typos. Since the verification task was not bid into the original contract, and we had neither the time nor resources to verify all 200,000 utterances, we concentrated our verification effort on the utterance types that contained the most transcription errors.

    Utterance types and transcription error rates are listed below. A large share of the transcription errors for "names", "WSJ/TIMIT", and "place names" were for mispronunciations that the transcribers didn't catch.

    		type            % of utts with errors
    		----            ---------------------
    		names                    4.5
    		WSJ/TIMIT                2.7
    		place names              2.7
    		ATIS                     2.3
    		dates                    2.2
    		dollars                  1.8
    		panel ID                 1.8
    		personal                 1.7
    		numbers                  1.7
    		spelled words            1.5
    		credit card #            1.5
    		phone #                  1.1
    		city in state             .73
    		fractions                 .73
    		words                     .54
    		yes/no                    .44
    		time                     0
    Tables containing information about speakers and calls were created and distributed via ftp. A description of the tables follows.

    TABLE 1: Caller Demographics

    This gives the call number, the caller's sex, age, home state, income group, and education group. Sex and age are determined from the caller's responses to questions asked during data collection, or, if those weren't available (or couldn't be determined), we used information from the panel house. The home state, income group, and education group were all determined from panel house information, tracked with the panel ID number the callers' gave. A significant number of callers did not give a valid panel ID number, so not all information could be listed for them. Question marks are used in any field were the information couldn't be determined from any source.

    Table entries are a comma separated lists:

    		Call #, Sex, Age, State, Income, Education
    Income Decoding table:
    		1 - Under 12,500
    		2 - 12,500 - 24,999
    		3 - 25,000 - 39,999
    		4 - 40,000 - 59,999
    		5 - 60,000 and Over
    Education Decoding table: (We list the higher of Female Head of Household's Education and Male Head of Household Education)
    	1 - Elementary: Less than 8 years
    	2 - Elementary: 8 years (graduate)
    	3 - High School: 1-3 years
    	4 - High School: 4 years (graduate)
    	5 - College: 1-3 years (attended college or Associate degree)
    	6 - College: (graduate)
    	7 - College: (postgraduate studies)
    	0 - No Answer
    TABLE 2: Profile of the Call

    This lists call number, date, time, incoming line number, whether it was on a cordless phone, and the number of good utterances from the call. The cordless phone field is filled in based on the caller's response to a question at the beginning of data collection. If the answer couldn't be determined, a '?' is listed.

    Table entries are comma separated lists:

    	Call #, Date, Time, Line Number, Cordless?, # of Good Utts
    TABLE 3: Transcription and Utterance Profile

    This table lists the call number, the utterance number within the call, the utterance type, and transcription for the utterance (in quotes).

    Table entries are comma separated lists:

    	Call #, Utt #, Utt type, Transcription
    	12000058,05,natural_number,"one thousand five hundred twenty"
    	12000058,07,time,"nine forty six a m eastern standard time"
    	12000058,08,date,"september twenty fifth nineteen forty six"
    	12000058,09,place,"richmond virginia"
    	12000058,10,digits,"zero five five five one six eight"
    	12000058,11,application_word,"divided by"
    	12000058,14,place,"des moines iowa"
    TABLE 4: Panel House Information

    This table contains the raw panel house information, and can be decoded according to information provided in the file "panlhous.doc". Sample entries are shown here:

    	0000125421290294120359115444211113115551242347   1178111831     \
    		  381           07
    	0000127420030295120457085555211107115551242252   097810880108832\
    	          362           07
    Each entry is 90 characters wide, and contains 50 distinct, fixed-width data fields. There are no explicit separators between fields; space characters represent (portions of) fields that have been left blank.

    TABLE 5: Summary Utterance Inventory

    This table was compiled by LDC from the Transcription and Utterance Profile (Table 3). Like the first two tables, it contains one entry per call, with the first field of each entry being the call number. The second field is a 44-character string which encodes the utterances that are present and absent for the call. Utterances that are present in the call are represented by an alphabetic character that indicates the response type for that utterance; missing utterances are represented by an underscore character. Following this string there are 16 numeric fields, representing the number of utterances present for each of the 15 response types, and the total number of utterances present in the call. All fields are separated by commas.

    The following sample lines from this table show a call in which all utterances are present (none are missing), and a call in which only 26 utterances are present (18 are missing):

    The following list shows, for each of the response types, the number of such responses that would be present in a complete call, and the letter used to represent each type in second field of the table; the sequence of types in this list is identical to the sequence of numeric fields in the table:
    		6 w application_word
    		5 y yes/no
    		4 n natural_number
    		4 a dollar_amount
    		3 r name_at_address
    		3 o digits
    		3 p place
    		3 d date
    		3 T TIMIT
    		2 s spelled_word
    		2 t time
    		2 A ATIS
    		2 W WSJ
    		1 c name_at_agency
    		1 f fraction
    For convenience, an additional file has been provided, called "uttsumry.hdr", that consists only of three lines containing properly aligned headings for each column of "uttsumry.tbl". To provide labels for each column of the table, simply append this ".hdr" file at the beginning of the ".tbl" file.