File: transpec.doc, Updated 07/20/95 Transcription Specifications for Marketplace Broadcast Corpus Updates: 950707: added "rev" field to "broadcast" SGML tag. 950720: added "turn-based" time marks 950720: added clarification regarding story bracketing Marketplace transcription files will be named according to the following convention: YYMMDD.txt where YYMMDD is the date of the Marketplace broadcast. 1. The basic transcription file will be for an entire broadcast. Markers of internal segments like "story" will be included in the transcription file to facilitate later break-outs for testing, etc. 2. These SGML-like markers will be used to segment the transcribed speech and/or specify attributes of the segments: - "broadcast", delimiting a broadcast, including an i.d. and revision date, e.g.: ... - "story", delimiting stories, including an ID, topic label, begin time, and end time e.g. ... The id will be an integer number which indicates the order of the story in the broadcast. Note that "credits" and self identification by the anchor-person should be exluded from adjoining stories. Self identification by correspondents or commentators should be included within their stories. (The transcribers should just add the ID and topic labels. NIST will add the time marks) - "language", delimiting foreign language passages, e.g. ... - "sung", delimiting sung lyrics, e.g.: ... 3. Unique speakers within a broadcast file will be identified by letters A, B, C, ..., AA, AB, ..., ZZ, extending the "A/B" identification used in Switchboard. When a transcriber is in doubt about whether a new speaker is one they've heard before, they should assume the speaker is new, use the next letter, and flag it with a comment for later verification. A separate file will be created to hold speaker information for the broadcast. The broadcast speaker information file will have the same basename as the transcription file, but will have an ".spk" extension. Lines in this file will give as much information about each speaker as can be gleaned from the recording: - name, e.g.: speaker_a_name: Henry Gomez - sex (male, female, unknown), e.g.: speaker_a_sex: male - dialect (optional, default is native spkr of American English), e.g.: speaker_a_dialect: Hispanic - age (optional, default is adult [child, adult, elderly]), e.g.: speaker_a_age: adult - role (if known), e.g. speaker_a_role: airline attendant (if the separate files are an inconvenience to the transcribers, NIST can separate them.) 4. Each speaker's turn in the broadcast will be prefixed by the letter i.d. of the speaker in uppercase, a colon, and a space, and transcription of turns will be separated by a blank line, e.g.: A: And now, here's a report from Madrid. B: This is Wally Balew, reporting for K E R A in ... For test data, turn-based time marks can be added between the speaker i.d. and the colon, e.g.: A(bt=101.45 et=103.23): And now, here's a report from Madrid. The time marks should cover an entire speaker turn, even if it crosses story boundaries. 5. Stretches of obviously different audio characteristics, such as recordings from a telephone vs. studio recording, will be tagged with [audio_change] at their beginning, e.g.: B: An airline attendant -- we'll call her Melissa -- had this to say. [audio_change] C: Jeez, why don't they pay us more? 6. Each non-speech sound in the recording should be marked, using one of these tags listed below. Note that you can always use [noise] to transcribe something that isn't very well described by any of the other tags: [noise] [music] [inhaling] [cough] [door] [phone_ringing] [sigh] [throat_clearing] If the event being described lasts longer than a few words, then indicate the beginning, followed by a slash, in brackets [ ], and the end, preceded by a slash, in brackets, e.g.: A: [music/] And now, here's our Madrid correspondent [/music] 7. Comments may be inserted into the text of the transcription, marked by double pairs of curly brackets, e.g. "{{ weird voice quality here }}". 8. If the transcribers can't decide between two words, both of them may be used as alternates, enclosed in curly brackets and separated by a slash, e.g.: "{which/this}", "{they are / they're}". This convention should only be used where there is ambiguity about which word or words the speaker said. It should NOT be used when the transcriber is unclear on the spelling of a word and wishes to pose alternative spellings. 9. Transcribers should type a contraction whenever a contraction is clearly heard. In doubtful cases, enter both forms, using the notation for alternations, e.g. "{they are / they're}". 10. Partial words should be ended with a hyphen. If the transcriber knows what the word was, the unsaid part should be enclosed in parentheses, e.g. "San Franc(isco)-". If not, just end the word with a hyphen, e.g. "... to the f- f- f- initial part ... ". 11. Words that are heard, but can not be identified due to background noise or bad pronunciation should be enclosed in double parentheses, e.g. " ... I ((thought)) the answer was ...". This notation should NOT be used to mark questioned spellings. 12. Accent marks and other diacritics need not be transcribed, but if they are, these markers should be used: SYMBOL USAGE EXAMPLE \3 (acute) add acute accent to next letter resum\3e \4 (grave) " grave " " " " Amp\4ere \5 (circumflex) " circumflex " " " " r\5ole \6 (umlaut/dieresis) " umlaut " " " \6Ohman \7 (tilde) " tilde " " " ma\7nana \8 (cedilla) " cedilla " " " fa\8cade 13. All spoken words should be spelled out, e.g., Junior, not Jr., Saint not St., etc. Abbreviations should not be used. The exception to this rule is that Mr., Mrs., Messrs., and Ms. should be represented in their abbreviated form since no commonly accepted spellings exist for at least some of them. Likewise, Letter and number sequences should be spelled out: D F W, seven forty-seven, U S A, one O one, F B I, etc., unless the letter sequence is pronounced as a word, as in NASA, ROM, DOS. 14. Words that have been run together, such as "gonna", "y'all", "kinda", etc., should be transcribed as the separate words: "going to", "you all", "kind of". 15. Simultaneous talking, where the speech of two speakers overlaps in time, should be marked by tagging the beginning and ending of the overlapping sections with a pound sign (#). The speech of both the talkers should be marked this way, e.g.: A: I never heard such nonsense, you know, # as I heard that # B: # Yeah, I know. # A: day when I blah blah blah 16. If a word or words is clearly heard and understood, but the proper spelling cannot be determined, an "@" should be prepended to the word or words in question. This may occur frequently with proper names. ALL occurrences of questioned words should contain this notation, not just the first. e.g., "... Israeli prime minister @Yitzhak @Rabin today and ..."