File: transpec.doc, Updated 07/20/95

   Transcription Specifications for Marketplace Broadcast Corpus

Updates:

950707: added "rev" field to "broadcast" SGML tag.
950720: added "turn-based" time marks
950720: added clarification regarding story bracketing


  Marketplace transcription files will be named according to the
   following convention:

	YYMMDD.txt

   where YYMMDD is the date of the Marketplace broadcast.


   1. The basic transcription file will be for an entire
      broadcast.  Markers of internal segments like
      "story" will be included in the transcription file
      to facilitate later break-outs for testing, etc.


   2. These SGML-like markers will be used to segment the
      transcribed speech and/or specify attributes of the
      segments:

     - "broadcast", delimiting a broadcast, including an i.d. and revision
        date, e.g.:
       <broadcast id="marketplace.950407" rev=950701>
        ...
       </broadcast>

     - "story", delimiting stories, including an ID, topic label,
        begin time, and end time e.g.

       <story id=1 topic="baseball" bt=205.23 et=310.45>
       ...
       </story>

       The id will be an integer number which indicates the order of the
       story in the broadcast.  Note that "credits" and self identification
       by the anchor-person should be exluded from adjoining stories.  Self
       identification by correspondents or commentators should be included
       within their stories.
 
       (The transcribers should just add the ID and topic labels.
       NIST will add the time marks)
     
     - "language", delimiting foreign language passages, e.g.

       <language Spanish>
       ...
       </language>

      - "sung", delimiting sung lyrics, e.g.:
        <sung>
        ...
        </sung>


   3. Unique speakers within a broadcast file will be
     identified by letters A, B, C, ..., AA, AB, ..., ZZ,
     extending the "A/B" identification used in Switchboard.
     When a  transcriber is in doubt about whether a new
     speaker is one they've  heard before, they should assume
     the speaker is new, use the next letter, and flag
     it with a comment for later verification.  

     A separate file will be created to hold speaker information for
     the broadcast.  The broadcast speaker information file will have
     the same basename as the transcription file, but will have an
     ".spk" extension.  Lines in this file will give as much
     information about each speaker as can be gleaned from the
     recording:

       - name, e.g.:
         speaker_a_name: Henry Gomez

       - sex (male, female, unknown), e.g.:
        speaker_a_sex: male

       - dialect (optional, default is native spkr of American English), e.g.:
         speaker_a_dialect: Hispanic

       - age (optional, default is adult [child, adult, elderly]), e.g.:
         speaker_a_age: adult

       - role (if known), e.g.
         speaker_a_role: airline attendant

       (if the separate files are an inconvenience to the transcribers,
       NIST can separate them.)


    4. Each speaker's turn in the broadcast will be prefixed 
       by the letter i.d. of the speaker in uppercase, a colon, and a
       space, and transcription of turns will be separated
       by a blank line, e.g.:

       A: And now, here's a report from Madrid.

       B: This is Wally Balew, reporting for K E R A in ...

       For test data, turn-based time marks can be added between the
       speaker i.d. and the colon, e.g.:
 
       A(bt=101.45 et=103.23): And now, here's a report from Madrid.

       The time marks should cover an entire speaker turn, even if it
       crosses story boundaries.


    5. Stretches of obviously different audio characteristics,
       such as recordings from a telephone vs. studio recording,
       will be tagged with [audio_change] at their beginning, e.g.:

       B:  An airline attendant -- we'll call her Melissa -- had
       this to say.

       [audio_change]

       C: Jeez, why don't they pay us more?


    6. Each non-speech sound in the recording should be marked,
       using one of these tags listed below.  Note that you can
       always use [noise] to transcribe something that isn't very
       well described by any of the other tags:
      
       [noise]
       [music]
       [inhaling]
       [cough]
       [door]
       [phone_ringing]
       [sigh]
       [throat_clearing]

       If the event being described lasts longer than a few words, then
       indicate the beginning, followed by a slash, in brackets [ ], and
       the end, preceded by a slash, in brackets, e.g.:
     
       A: [music/] And now, here's our Madrid correspondent [/music]


    7. Comments may be inserted into the text of the transcription, marked by
       double pairs of curly brackets,  e.g. "{{ weird voice quality here }}".


    8. If the transcribers can't decide between two words, both of
       them may be used as alternates, enclosed in curly brackets and
       separated by a slash, e.g.: "{which/this}", "{they are / they're}".
       This convention should only be used where there is ambiguity
       about which word or words the speaker said.  It should NOT be
       used when the transcriber is unclear on the spelling of a word
       and wishes to pose alternative spellings.


    9. Transcribers should type a contraction whenever a contraction is
       clearly heard.  In doubtful cases, enter both forms, using the
       notation for alternations, e.g. "{they are / they're}".


   10. Partial words should be ended with a hyphen.  If the transcriber
       knows what the word was, the unsaid part should be enclosed in
       parentheses, e.g. "San Franc(isco)-".  If not, just end the
       word with a hyphen, e.g. "... to the f- f- f- initial part ... ".


   11. Words that are heard, but can not be identified due to
       background noise or bad pronunciation should be enclosed in
       double parentheses, e.g. " ... I ((thought)) the answer was
       ...".

       This notation should NOT be used to mark questioned spellings.


   12. Accent marks and other diacritics need not be transcribed, but
      if they are, these markers should be used:

       SYMBOL              USAGE                               EXAMPLE
       \3 (acute)          add acute accent to next letter     resum\3e
       \4 (grave)           "  grave    "    "  "      "       Amp\4ere
       \5 (circumflex)      "  circumflex "  "  "      "       r\5ole
       \6 (umlaut/dieresis) "    umlaut      "  "      "       \6Ohman
       \7 (tilde)           "    tilde       "  "      "       ma\7nana
       \8 (cedilla)         "   cedilla      "  "      "       fa\8cade


   13. All spoken words should be spelled out,  e.g., Junior, not Jr.,
       Saint not St., etc.  Abbreviations should not be used.  
       The exception to this rule is that Mr., Mrs., Messrs., and Ms.
       should be represented in their abbreviated form since no
       commonly accepted spellings exist for at least some of them.
 
       Likewise, Letter and number sequences should be spelled out: 
       D F W, seven forty-seven, U S A, one O one, F B I, etc., unless the
       letter sequence is pronounced as a word, as in NASA, ROM, DOS.
 

   14. Words that have been run together, such as "gonna", "y'all",
      "kinda", etc., should be transcribed as the separate words:
      "going to", "you all", "kind of".


   15. Simultaneous talking, where the speech of two speakers overlaps
       in time, should be marked by tagging the beginning and ending
       of the overlapping sections with a pound sign (#).  The speech
       of both the talkers should be marked this way, e.g.:

       A: I never heard such nonsense, you know, # as I heard that #

       B: # Yeah, I know. #

       A: day when I blah blah blah


   16. If a word or words is clearly heard and understood, but the proper 
       spelling cannot be determined, an "@" should be prepended to the
       word or words in question.  This may occur frequently
       with proper names.  ALL occurrences of questioned words should
       contain this notation, not just the first.
       
       e.g., 

       "... Israeli prime minister @Yitzhak @Rabin today and ..."