Data transcription - USC MARKETPLACE All pertinent USC MARKETPLACE speech files were transcribed using the general conventions described below. USC Marketplace was originally transcribed by FDCH (Federal Document Clearing House Inc.). The transcripts were then enhanced for the linguistic research community by the LDC. The LDC enhancement of the USC MARKETPLACE transcription was carried out on Sun workstations. The transcription was done using the emacs text editor which was linked to the visual and auditory soundwave from the speech recording in an xwaves window. A program written at the LDC linked the xwaves signal to the emacs buffer so that a highlighted region of the soundwave could be brought into the emacs buffer as a timestamp via a simple keystroke. Similarly, it was possible to listen to any timemarked turn in the transcript, and view the aligned soundwave as well. Thus, a visual as well as auditory signal was transcribed. The transcription conventions provided below serve as a guideline as to how the USC MARKETPLACE speech files were transcribed. USC MARKETPLACE TRANSCRIPTION CONVENTIONS - General What to transcribe: All speech files Definition of turns: Separate turns are defined by the following criteria: (1) speaker change (2) within one speaker's stretch of speech, a long turn should be broken up in terms of what makes grammatical/semantic sense. (3) If there is an extra-long pause within a single speaker's turn, break the turn up into two turns. Timestamps: Each speaker turn is marked with a unique timestamp (in seconds). The timestamps mark the beginning of each turn relative to the beginning of the recording. Each timestamp is precise to the 100th of a second. In USC MARKETPLACE timestamps have the following formats. (1) : filler, transcribed. <> {breath} It's Friday, June seventh. (2) : not transcribed (typically commercials) <> {breath} It's Friday, May seventeenth. (3) : story <> {breath} There's a report that China is shopping for laid-off American aerospace engineers to help Beijing improve its military aircraft engines. (4) : change in speaker within a story <> You have a situation in which the US market is essentially completely open to Chinese goods. (5) : same speaker, broken up for semantic reasons The Chinese market is only open on a restricted basis to approved US goods and investments. The timestamps mark the beginning of each turn relative to the beginning of the recording. Each timestamp is precise to the 100th of a second. Special Conventions: Acronyms Acronyms pronounced like a word are written in all caps with no spaces, e.g. AIDS NARAL Acronyms pronounced like the individual letters are written in all caps with spaces between the letters: C I A H I V C E O Numbers All numbers are written out: twenty two nineteen ninety-five Interjections The most standard spelling is used: uh-huh mhm uh-oh okay jeez Punctuation Due to the nature of the USC MARKETPLACE speech files, punctuation was not added to the transcription. Special symbols: Noises, conversational phenomena, foreign words, etc. are marked with special symbols. In the table below, "text" represents any word or descriptive phrase. {text} sound made by the talker {laugh} {scream} {breath} ((text)) unintelligible; text is best guess at transcription ((airfield)) (( )) unintelligible; can't even guess text (( )) -text partial word text- -tion absolu- %text This symbol flags non-lexemes, which are general hesitation sounds, e.g. %mm %uh &text used to mark proper names and place names &John &Nassau &Space &Center [[NS]] indicates non-transcribed elements (e.g. music) longer than three seconds within a turn used to mark untranscribed speech in a foreign language, the language is indicated within the < > brackets 7. Quality control (QC) procedures The creation of the transcripts was made in an iterative manner. The first step was to transcribe and timestamp the appropriate portion of each conversation. Once this was completed, proper formatting and spelling was checked and corrected. Then, a second pass over all of the transcripts was made, where both content and formatting were checked once more. Throughout this process, small improvements were constantly made and re-checked for accuracy. Syntax: To check the well-formedness of the bracketing, a program was written which goes over the transcripts and notes any apparent irregularities. This program was later adapted for on-line use by the transcribers to be used while creating the transcripts. Timestamps: To check the well-formedness of timestamps, a program was developed that checked for (1) overlapping timestamps, (2) start times that are greater than end times, (3) turns that are missing timestamps, (4) the proper formatting of a blank line before each timestamp, (5) the proper number of digits in each timestamp, and (6) the proper marking of the speaker id. This procedure was folded into the syntax checking procedure to be used on-line by the transcribers. Content: To check that the properly spelled and formatted transcription actually matched the spoken signal, a second human pass was made over all of the transcripts. In many instances, three or more passes were made.