------------------------------------------------------------- Description of the HUB-4 1997 Broadcast News Corpus, CSR-VI Transcription Conventions ------------------------------------------------------------- February, 1998 Project Leader: Jennifer Alabiso Programming: David Graff Robert McIntyre Zhibiao Wu Personnel: Jennifer Alabiso Nii Martey Kara Rennert Transcribers: Stephanie Strassel Chris DeVita Ken Luguya Bianca Torrez Jon Cole James Siegle Larry Kowerski Marcy Bruce CONTENTS 0. Introduction 1. What to transcribe 2. Information Organization 3. Timestamps 4. Orthography 5. Punctuation 6. Symbols 7. Noises 8. Other Conventions --------------------------------------------------------------------- 0) Introduction This file describes the conventions employed by transcribers at the LDC during the creation of the 1997 Broadcast News transcripts. The following sections are structured in the form of instructions to the transcribers, covering the issues that arise in the transcription task. PLEASE NOTE: the various examples of transcription practice that are provided below represent a format for transcript files that has been used internally by the LDC; the actual released version of the transcript is noticeably different, consisting of a fully defined SGML document structure. This released SGML format is documented separately, comes with a DTD file, and is derived automatically from the internal working file format that is shown here. --------------------------------------------------------------------- 1) What to transcribe? The goal is to transcribe the entire news broadcast. You should first divide the broadcast into sections. Of all of the sections, you should only transcribe those that are reports, "sr" (section=report), (including weather) ,or filler material, "sf" (section=filler). "Reports" are defined as story-specific news items "Filler" is defined as upcoming news items, introductory reporter "chit chat", **Items which should not be transcribed: Commercials; material repeated between broadcasts; and anything too "difficult" to understand. Generally, If it is necessary to listen to a passage more than 4 times in order to understand anything, it is probably too difficult to transcribe. Also, speech that is obscured by heavy distortion or overwhelming background noise. If any portion of the broadcast is skipped, you should provide a time-stamp of the skipped speech portion (even if it is a minute long).Use the notation "sn" (section non-transcribed) to designate sections that fall into the categories above. Furthermore, if the material is marked as "sn" because it is a repeat of material found elsewhere in the transcripts, add the notation [[repeat]] after the "sn". If you happen to know the other source for the repeated material, include that information (file id, timestamp(s) if you know it) after the [[repeat]]: [[repeat]] [[repeat sv970613d at time 708.388 to 840.328]] For the sections marked as you should not provide any transcription. If you have any questions about this, please consult your language leader. ------------------------------------------------------------------------ 2) Information organization The hierarchy of a transcript has two levels: Section Turn Broadcast speech: - Divided into sections - Sections are subdivided into turns (defined by speaker change) - both section and turn boundaries coincide with a beginning and end breakpoint (timestamp) Sections: -------- Definition of section: In broadcast speech transcripts, there are multiple sections, each of which corresponds to one of the following three types: Report Filler (for example, program introduction, chit-chat) Nontrans (for example, commercials, long segments of pure music) Turns: ----- Separate turns are defined as an occurrence of a speaker change. -------------------------------------------------------------------------- 3) Timestamps (or Breakpoints) Breakpoints are places where the transcriber has inserted a timestamp to delineate a portion of speech for the purposes of ease-of-transcription. From the point of view of the transcriber, the broadcasts are segmented into a series of breakpoints, some of which mark turn boundaries, others of which occur within a turn. Breakpoints can be inserted wherever they seem convenient to the transcriber. They should occur at the natural boundaries of speech, such as pauses, breaths, etc. They should never occur in the middle of a word, even in cases of overlapped speech. A special subclass of breakpoints marks the beginning and end points of overlapped speech; that is, periods of the recording where there are multiple speakers talking at once. Use the notation to mark the beginning of the overlapping-speech section and to mark the end. should also include the number id of the speaker who ceased speaking first, (ie , or ). If both speakers stop at the same time, the proper notation should be the next turn or section start. ------------------------------------------------------------------------ 4) Orthography We are following the general orthographic conventions (spelling) for English. Words that usually take capital letters should be written with capital letters, otherwise lowercase should be used. In addition, we have a set of clearly defined symbols that should be used with items such as proper names, acronyms, mispronounced words, and non-lexemes (see below). Capitalization: capitalization in our transcripts is used as an aid for human comprehension of the text. You should follow the accepted standard way to capitalize words, including words at the beginning of a sentence, proper names, and so on. He took the car on Saturday. Jane was walking along Walnut Street when I met her. Numerals: write out all numerals. Only hyphenate numbers between twenty-one and ninety-nine twenty-two nineteen ninety-five seven thousand two hundred seventy-five nineteen oh nine Abbreviations: When abbreviations are used as part of a title, they can remain as abbreviations: Mr. Brown Mrs. Jones Dr. Spock However, when they are not used in this fashion, write them out in full. I went to the junior league game. I'm going home to see the missus I went to the doctor, and all he said was, don't worry, it's natural. Hey mister, please stop hitting me. -------------------------------------------------------------------------- 5) Punctuation The following punctuation marks should be used in the transcripts. The punctuation marks are primarily for ease of (human) reading. Use only those punctuation marks indicated below. - periods "." should be added at the end of declarative sentences question marks "?" should be added at the end of interrogative sentences commas "," should be added between clauses ------------------------------------------------------------------------- 6) Symbols Acronyms I: those that are pronounced as a single word should be written in caps (no spaces) and preceded by a "@" symbol: @NATO @DARPA @AIDS Acronyms II: acronyms that are normally written as a single word but pronounced as a sequence of individual letters should be written in all caps (no spaces) and preceded by a "~" symbol: ~FBI ~CEO ~YMCA Individual letters: Individual letters that are pronounced as such should be written in caps and preceded by a "~" symbol: I got an ~A on the test. his name is spelled ~S ~I ~M ~P ~S ~O ~N. Proper names: both proper names and place names should be marked with a "^"symbol. If you encounter a "proper name phrase", mark only those words as proper names that are true proper names on their own. Personal initials are treated as individual letters in our transcripts. Initials should be written in capital letters, be preceded by the "~" and must not have a period after them unless this marks the end of a sentence. If the spelling of the name is uncertain, use a double caret (^^), to indicate this, and the spelling can be further researched during the second pass. ^Homer ~L ^Simpson ^Beijing ^Sony ^Maria's Bar and Grill he calls himself ~J ~R ^Jones ^^Rafjanii ^Agrawal Partial words: partial words are indicated with a dash (without any spacing between the dash and the word): absolu- -tion Mispronounced words:if a word is mispronounced (such as a slip of the tongue), provide the correct spelling of the word, and place a "+" symbol in front of the word: +probably +yesterday Interjections: in each language, we have a set of standardized spellings for interjections. English interjections mhm uh-huh uh-oh okay whoa whew yeah jeeze Non-lexemes: in addition to the interjections (which are considered to be words), we also have a set of standardized spellings for hesitation sounds that speakers make while speaking in each language. Every such "non word" in the transcripts is marked with the "%" symbol. English non-lexemes %ach %ah %eee %eh %ew %ha %hee %huh %hm %huh %um %uh %oh IDIOSYNCRATIC WORDS: if a speaker uses a "made-up" word which is not used by other speakers (although it may be understandable), place a "*" symbol before the word. Consult your language leader in cases where you are uncertain whether a word fits in this category. Onomatopoeia fits into this category: *poodle-ish Do you dress like a *schlump yet? why she said *drr I don't know --------------------------------------------------------------------- 7) Noises In order to account for sound phenomena such as distortion, coughs, breaths, unintelligible speech, foreign words and phrases, etc, we utilize a set of unique brackets. {text}: sound made by the talker. Use only those sounds described below: {laugh} {cough} {sneeze} {breath} {lipsmack} ---------------------------------------------------------------------- 8) Other conventions ((text)): unintelligible speech. This is the transcriber's best guess. ((wonderful)) Well, I ((thought)) that it was fine. And then she told me that I should ((just leave)). (( )): unintelligible speech that you cannot even make a guess at (with a single space between the parentheses). This should be isolated from the rest of the text during second pass unless the occurrence is for a very brief period of time. I went to the (( )) on my way over. : this is used to indicate speech (one or more words) in another language. In place of "language", write the name of the language,if known. If the language is not known, treat the case as the same as unintelligible speech as above, with (( )). And then I took all of the to my room. Oh, , he said. then there were a couple of (( )) which I tried on. [[NS]]: non-transcribed area between breakpoints. (Or start of a turn - see overlapping simultaneous speech.) Used when there is an area within a turn that has no speech within it , i.e. a musical interruption, or extended background noise. The crowd was furious. [[NS]] Calm was soon restored by the arrival of the riot police. Overlapping Speech: Overlapping speech is when a speaker is interrupted by another speaker, at a roughly equal volume. Situations when a reporter is speaking over a political speech (recorded or live), - not considered to be overlapping, unless the volume is very high. In situations where overlapping speech occurs, insert the breakpoint at the beginning of the word in which the interruption started, in other words, at the end of the last complete word. i) In this situation , reporter1 is interrupted by speaker1. Reporter1stops speaking, and speaker1 carries on. <> <> SPEAKER1: SPEAKER2: <> ii) If the individual who interrupted is subsequently interrupted themselves, after continuing to speak, indicate the overlap in the same manner - they now are designated speaker1. <> <> SPEAKER1: SPEAKER2: <> SPEAKER1: SPEAKER2: Simultaneous Overlapping Speech: When two speakers start to speak simultaneously, create an initial turn to identify a speaker, and insert the overlap. <> [[NS]] <> SPEAKER1: SPEAKER2: Several speakers: In situations when you have several people speaking at once, and it is very difficult to make them out, insert an <> <> Speaker Identification: For broadcast speech the goal is to identify speakers as precisely as possible. At the very least, each unique speaker in a recording should have a unique identification. Further information to be added includes speaker gender and proper name if possible. Gender: The possibilities are "male" , "female", "child" "altered", "unison". "Unison" occurs in situations where two or more individuals say the same thing at the same time. Proper names Whenever possible, include the proper name of the speaker. Examples of proper names include Jacques_Cousteau, William_Cohen, and Madeleine_Albright. [namesearch] If a speaker is not identified within a recording, a unique numerical index is to be used. For the convenience of transcribers, a broad categorical identification can be used. The two categories currently supported are Reporter and Speaker. Reporter refers to either the anchor of the news broadcast, or the reporter on location giving the story. Speaker, on the other hand, refers to anyone interviewed on tape by the Reporter, when that person is not identified by name. When identifying nameless speakers, keep in mind that it is the number assigned to that voice which is the crucial information more than the category. Numbers must not overlap. Each successive anonymous speaker should have a unique number, regardless of the category the speaker is assigned to. For example, the following sequence is entirely possible: reporter_1 reporter_2 spkr_3 spkr_4 spkr_5 reporter_2 (assuming it is the same voice as the previous Reporter_2) reporter_6 (a new reporter distinct from the two above) Native, non-native, and altered In English broadcast, "native" speakers are standard North American dialects. These are not marked. "Non-native" speakers, are determined as foreign accented speakers, including British-English speakers."Altered" is used to tag deliberately altered voice patterns, for instance in the case of a disguised informant's speech, or for machine generated speech. Examples <> <> <> <> <>