-------------------------------------------------------------
	Description of the HUB-4 1997 Broadcast News Corpus, CSR-VI
			Transcription Conventions
	-------------------------------------------------------------

February, 1998

Project Leader:		Jennifer Alabiso

Programming:		David Graff
			Robert McIntyre
			Zhibiao Wu

Personnel:		Jennifer Alabiso
			Nii Martey
			Kara Rennert

Transcribers:		Stephanie Strassel
			Chris DeVita
			Ken Luguya
			Bianca Torrez
			Jon Cole
			James Siegle
			Larry Kowerski
			Marcy Bruce


CONTENTS
	0. Introduction
	1. What to transcribe
	2. Information Organization
	3. Timestamps
	4. Orthography
	5. Punctuation
	6. Symbols
	7. Noises
	8. Other Conventions
---------------------------------------------------------------------

0) Introduction

This file describes the conventions employed by transcribers at the
LDC during the creation of the 1997 Broadcast News transcripts.  The
following sections are structured in the form of instructions to the
transcribers, covering the issues that arise in the transcription
task.

PLEASE NOTE: the various examples of transcription practice that are
provided below represent a format for transcript files that has been
used internally by the LDC; the actual released version of the
transcript is noticeably different, consisting of a fully defined SGML
document structure.  This released SGML format is documented
separately, comes with a DTD file, and is derived automatically from
the internal working file format that is shown here.

---------------------------------------------------------------------

1) What to transcribe?


	The goal is to transcribe the entire news broadcast. You
should first divide the broadcast into sections. Of all of the
sections, you should only transcribe those that are reports, "sr"
(section=report), (including weather) ,or filler material, "sf"
(section=filler).  "Reports" are defined as story-specific news items
"Filler" is defined as upcoming news items, introductory reporter
"chit chat",

**Items which should not be transcribed:


	Commercials; material repeated between broadcasts; and
anything too "difficult" to understand. Generally, If it is necessary
to listen to a passage more than 4 times in order to understand
anything, it is probably too difficult to transcribe. Also, speech
that is obscured by heavy distortion or overwhelming background noise.


	If any portion of the broadcast is skipped, you should
provide a time-stamp of the skipped speech portion (even if it is a
minute long).Use the notation "sn" (section non-transcribed) to
designate sections that fall into the categories above.

   <sn 323.08> 

	Furthermore, if the material is marked as "sn" because it is a
repeat of material found elsewhere in the transcripts, add the
notation [[repeat]] after the "sn". If you happen to know the other
source for the repeated material, include that information (file id,
timestamp(s) if you know it) after the [[repeat]]:

		<sn 323.08> 
		[[repeat]] 
		<sn 156.997> 
		[[repeat sv970613d at time 708.388 to 840.328]] 

	For the sections marked as <sn> you should not provide any
transcription.

	If you have any questions about this, please consult your
language leader.

------------------------------------------------------------------------

2) Information organization

The hierarchy of a transcript has two levels: 

Section 
Turn 

Broadcast speech:

- Divided into sections 
- Sections are subdivided into turns (defined by speaker change) 
- both section and turn boundaries coincide with a beginning and end
breakpoint (timestamp) 

	Sections:
	--------

Definition of section: 

In broadcast speech transcripts, there are multiple sections, each of which
corresponds to one of the following three types: 

	Report 
	Filler (for example, program introduction, chit-chat) 
	Nontrans (for example, commercials, long segments of pure music)      

	Turns:
	-----

Separate turns are defined as an occurrence of a speaker change. 
--------------------------------------------------------------------------

3) Timestamps (or Breakpoints) 

	Breakpoints are places where the transcriber has inserted a
timestamp to delineate a portion of speech for the purposes of
ease-of-transcription. From the point of view of the transcriber, the
broadcasts are segmented into a series of breakpoints, some of which
mark turn boundaries, others of which occur within a turn.

	Breakpoints can be inserted wherever they seem convenient to
the transcriber. They should occur at the natural boundaries of
speech, such as pauses, breaths, etc. They should never occur in the
middle of a word, even in cases of overlapped speech.

	A special subclass of breakpoints marks the beginning and end
points of overlapped speech; that is, periods of the recording where
there are multiple speakers talking at once. Use the notation <o> to
mark the beginning of the overlapping-speech section and <e> to mark
the end. <e> should also include the number id of the speaker who
ceased speaking first, (ie <e1>, or <e2>). If both speakers stop at
the same time, the proper notation should be the next turn or section
start.

------------------------------------------------------------------------                                                       
4) Orthography

	We are following the general orthographic conventions
(spelling) for English. Words that usually take capital letters should
be written with capital letters, otherwise lowercase should be used.

	In addition, we have a set of clearly defined symbols that
should be used with items such as proper names, acronyms,
mispronounced words, and non-lexemes (see below).

	Capitalization: capitalization in our transcripts is used as
an aid for human comprehension of the text. You should follow the
accepted standard way to capitalize words, including words at the
beginning of a sentence, proper names, and so on.

		He took the car on Saturday. 
		Jane was walking along Walnut Street when I met her. 
  
Numerals: write out all numerals. Only hyphenate numbers between
twenty-one and ninety-nine

		twenty-two 
		nineteen ninety-five 
		seven thousand two hundred seventy-five 
		nineteen oh nine 
         
Abbreviations: When abbreviations are used as part of a title, they
can remain as abbreviations: 

		Mr. Brown 
		Mrs. Jones 
		Dr. Spock 

However, when they are not used in this fashion, write them out in full. 
         
		I went to the junior league game. 
		I'm going home to see the missus 
		I went to the doctor, and all he said was, don't
		worry, it's natural.
		Hey mister, please stop hitting me. 
        
--------------------------------------------------------------------------

  
5) Punctuation

The following punctuation marks should be used in the transcripts. The
punctuation marks are primarily for ease of (human) reading. Use only those
punctuation marks indicated below.

	- periods "." should be added at the end of declarative
sentences question marks "?" should be added at the end of
interrogative sentences commas "," should be added between clauses

-------------------------------------------------------------------------  

                                       
6) Symbols

Acronyms I: those that are pronounced as a single word should be written in
caps (no spaces) and preceded by a "@" symbol:

		@NATO 
		@DARPA 
		@AIDS 
         
                                                         
Acronyms II: acronyms that are normally written as a single word but
pronounced as a sequence of individual letters should be written in
all caps (no spaces) and preceded by a "~" symbol: 

		~FBI 
		~CEO 
		~YMCA 

                                                          
Individual letters: Individual letters that are pronounced as such should be
written in caps and preceded by a "~" symbol:

		I got an ~A on the test. 
		his name is spelled ~S ~I ~M ~P ~S ~O ~N. 
         
         
Proper names: both proper names and place names should be marked
with a "^"symbol. If you encounter a "proper name phrase", mark only those
words as proper names that are true proper names on their own. Personal
initials are treated as individual letters in our transcripts. Initials
should be written in capital letters, be preceded by the "~" and must not
have a period after them unless this marks the end of a sentence.
	If the spelling of the name is uncertain, use a double caret (^^),
to indicate this, and the spelling can be further researched during the
second pass.

		^Homer ~L ^Simpson 
		^Beijing 
		^Sony 
		^Maria's Bar and Grill 
		he calls himself ~J ~R ^Jones 
		^^Rafjanii ^Agrawal 
         
  
Partial words: partial words are indicated with a dash (without any spacing
between the dash and the word): 

		absolu- 
		-tion 
         
Mispronounced words:if a word is mispronounced (such as a slip of the
tongue), provide the correct spelling of the word, and place a "+"
symbol in front of the word: 

		+probably 
		+yesterday 
         
Interjections: in each language, we have a set of standardized spellings for
interjections. 
              English interjections 
		mhm 
		uh-huh 
		uh-oh 
		okay 
		whoa 
		whew 
		yeah 
		jeeze 
         
Non-lexemes: in addition to the interjections (which are considered to be
words), we also have a set of standardized spellings for hesitation sounds that
speakers make while speaking in each language. Every such "non word" in the
transcripts is marked with the "%" symbol. 
              English non-lexemes 
		%ach 
		%ah 
		%eee 
		%eh 
		%ew 
		%ha 
		%hee 
		%huh 
		%hm 
		%huh 
		%um 
		%uh 
		%oh 
         
IDIOSYNCRATIC WORDS: if a speaker uses a "made-up" word which is not
used by other speakers (although it may be understandable), place a
"*" symbol before the word. Consult your language leader in cases
where you are uncertain whether a word fits in this
category. Onomatopoeia fits into this category:

		*poodle-ish 
		Do you dress like a *schlump yet? 
		why she said *drr I don't know 
         
---------------------------------------------------------------------

7) Noises

	In order to account for sound phenomena such as distortion,
coughs, breaths, unintelligible speech, foreign words and phrases,
etc, we utilize a set of unique brackets.

{text}: sound made by the talker. Use only those sounds described below: 

              {laugh} 
              {cough} 
              {sneeze} 
              {breath} 
              {lipsmack} 
         
----------------------------------------------------------------------

8) Other conventions

((text)): unintelligible speech. This is the transcriber's best guess.

              ((wonderful)) 
              Well, I ((thought)) that it was fine. 
              And then she told me that I should ((just leave)). 
         
                                                          
(( )): unintelligible speech that you cannot even make a guess at
(with a single space between the parentheses).  This should be
isolated from the rest of the text during second pass unless the
occurrence is for a very brief period of time.
         
		I went to the 
		<b 123.456> 
		(( )) 
		<b 127.890> 
		on my way over. 
               
<language text>: this is used to indicate speech (one or more words)
in another language. In place of "language", write the name of the
language,if known. If the language is not known, treat the case as the
same as unintelligible speech as above, with (( )).
         
		And then I took all of the <German Sachen> to my room. 
		Oh, <Spanish gracias>, he said. 
		then there were a couple of (( )) which I tried on. 

                                                          
[[NS]]: non-transcribed area between breakpoints. (Or start of a turn -
see overlapping simultaneous speech.) Used when there is an area
within a turn that has no speech within it , i.e. a musical
interruption, or extended background noise.
         
		<b 123.456 > 
		The crowd was furious. 
		<b 124.567> 
		[[NS]] 
		<b 128.987> 
		Calm was soon restored by the arrival of the riot police. 
                

Overlapping Speech: 

	Overlapping speech is when a speaker is interrupted by another
speaker, at a roughly equal volume. Situations when a reporter is
speaking over a political speech (recorded or live), - not considered
to be overlapping, unless the volume is very high.
         
                                                       
            In situations where overlapping speech occurs, insert the
breakpoint at the beginning of the word in which the interruption
started, in other words, at the end of the last complete word.

	i) In this situation , reporter1 is interrupted by
	speaker1. Reporter1stops speaking, and speaker1 carries on. 

		<t 122.445> <<female, reporter1>> 

		<o 123.456> <<male, spkr1>> 
		SPEAKER1: 
		SPEAKER2: 
		<e1 124.002>

		<t 128.689> <<female, reporter1>> 
         
                      ii) If the individual who interrupted is
subsequently interrupted themselves, after continuing to speak,
indicate the overlap in the same manner - they now are designated
speaker1.
                     
                   
                <t 122.445> <<female, reporter1>>

                <o 123.456> <<male, speaker1>> 
                SPEAKER1: 
                SPEAKER2: 
                <e1 124.002> 

		<o 128.345> <<male, speaker2>> 
		SPEAKER1: 
		SPEAKER2: 
		<e2 129.002> 
        

Simultaneous Overlapping Speech: 

	When two speakers start to speak simultaneously, create an
initial turn to identify a speaker, and insert the overlap.
         
	      <t 123.445> <<female>> 
               [[NS]] 
              <o 123.456> <<male>> 
              SPEAKER1: 
              SPEAKER2: 
              <e1 124.002> 
        

Several speakers: 
               
	In situations when you have several people speaking at
once, and it is very difficult to make them out, insert an <e tag at
the start of the confused section.
       
		Then start the new turn at the next available clear section. 

               <t 223.456> <<male>> 
         
               <b 225.678> 

               <e 230.302> 
               <t 232.563> <<female>>  
        

Speaker Identification:

	For broadcast speech the goal is to identify speakers as precisely as
possible. At the very least, each unique speaker in a recording should
have a unique identification. Further information to be added includes
speaker gender and proper name if possible.

Gender:

	The possibilities are "male" , "female", "child" "altered",
"unison". "Unison" occurs in situations where two or more individuals
say the same thing at the same time.  Proper names Whenever possible,
include the proper name of the speaker. Examples of proper names
include Jacques_Cousteau, William_Cohen, and Madeleine_Albright.
[namesearch]
    If a speaker is not identified within a recording, a unique
numerical index is to be used. For the convenience of transcribers, a
broad categorical identification can be used. The two categories
currently supported are Reporter and Speaker.
    Reporter refers to either the anchor of the news broadcast, or the
reporter on location giving the story.
    Speaker, on the other hand, refers to anyone interviewed on tape
by the Reporter, when that person is not identified by name. When
identifying nameless speakers, keep in mind that it is the number
assigned to that voice which is the crucial information more than the
category. Numbers must not overlap. Each successive anonymous speaker
should have a unique number, regardless of the category the speaker is
assigned to. For example, the following sequence is entirely possible:

   reporter_1 
   reporter_2 
   spkr_3 
   spkr_4 
   spkr_5 
   reporter_2 (assuming it is the same voice as the previous Reporter_2) 
   reporter_6 (a new reporter distinct from the two above)


	Native, non-native, and altered In English broadcast, "native"
speakers are standard North American dialects.  These are not
marked. "Non-native" speakers, are determined as foreign accented
speakers, including British-English speakers."Altered" is used to tag
deliberately altered voice patterns, for instance in the case of a
disguised informant's speech, or for machine generated speech.
               
       Examples

         <sr 1.402> <<male, Leon_Harris>> 
         <sr 158.244> <<female, Joie_Chen>> 
         <t 196.813> <<male, spkr_1>> 
         <t 498.314> <<male, non-native, spkr_3>> 
         <t 567.215> <<male, altered, spkr_4>>