File: transpec.doc, Updated 07/20/95
Transcription Specifications for Marketplace Broadcast Corpus
Updates:
950707: added "rev" field to "broadcast" SGML tag.
950720: added "turn-based" time marks
950720: added clarification regarding story bracketing
Marketplace transcription files will be named according to the
following convention:
YYMMDD.txt
where YYMMDD is the date of the Marketplace broadcast.
1. The basic transcription file will be for an entire
broadcast. Markers of internal segments like
"story" will be included in the transcription file
to facilitate later break-outs for testing, etc.
2. These SGML-like markers will be used to segment the
transcribed speech and/or specify attributes of the
segments:
- "broadcast", delimiting a broadcast, including an i.d. and revision
date, e.g.:
...
- "story", delimiting stories, including an ID, topic label,
begin time, and end time e.g.
...
The id will be an integer number which indicates the order of the
story in the broadcast. Note that "credits" and self identification
by the anchor-person should be exluded from adjoining stories. Self
identification by correspondents or commentators should be included
within their stories.
(The transcribers should just add the ID and topic labels.
NIST will add the time marks)
- "language", delimiting foreign language passages, e.g.
...
- "sung", delimiting sung lyrics, e.g.:
...
3. Unique speakers within a broadcast file will be
identified by letters A, B, C, ..., AA, AB, ..., ZZ,
extending the "A/B" identification used in Switchboard.
When a transcriber is in doubt about whether a new
speaker is one they've heard before, they should assume
the speaker is new, use the next letter, and flag
it with a comment for later verification.
A separate file will be created to hold speaker information for
the broadcast. The broadcast speaker information file will have
the same basename as the transcription file, but will have an
".spk" extension. Lines in this file will give as much
information about each speaker as can be gleaned from the
recording:
- name, e.g.:
speaker_a_name: Henry Gomez
- sex (male, female, unknown), e.g.:
speaker_a_sex: male
- dialect (optional, default is native spkr of American English), e.g.:
speaker_a_dialect: Hispanic
- age (optional, default is adult [child, adult, elderly]), e.g.:
speaker_a_age: adult
- role (if known), e.g.
speaker_a_role: airline attendant
(if the separate files are an inconvenience to the transcribers,
NIST can separate them.)
4. Each speaker's turn in the broadcast will be prefixed
by the letter i.d. of the speaker in uppercase, a colon, and a
space, and transcription of turns will be separated
by a blank line, e.g.:
A: And now, here's a report from Madrid.
B: This is Wally Balew, reporting for K E R A in ...
For test data, turn-based time marks can be added between the
speaker i.d. and the colon, e.g.:
A(bt=101.45 et=103.23): And now, here's a report from Madrid.
The time marks should cover an entire speaker turn, even if it
crosses story boundaries.
5. Stretches of obviously different audio characteristics,
such as recordings from a telephone vs. studio recording,
will be tagged with [audio_change] at their beginning, e.g.:
B: An airline attendant -- we'll call her Melissa -- had
this to say.
[audio_change]
C: Jeez, why don't they pay us more?
6. Each non-speech sound in the recording should be marked,
using one of these tags listed below. Note that you can
always use [noise] to transcribe something that isn't very
well described by any of the other tags:
[noise]
[music]
[inhaling]
[cough]
[door]
[phone_ringing]
[sigh]
[throat_clearing]
If the event being described lasts longer than a few words, then
indicate the beginning, followed by a slash, in brackets [ ], and
the end, preceded by a slash, in brackets, e.g.:
A: [music/] And now, here's our Madrid correspondent [/music]
7. Comments may be inserted into the text of the transcription, marked by
double pairs of curly brackets, e.g. "{{ weird voice quality here }}".
8. If the transcribers can't decide between two words, both of
them may be used as alternates, enclosed in curly brackets and
separated by a slash, e.g.: "{which/this}", "{they are / they're}".
This convention should only be used where there is ambiguity
about which word or words the speaker said. It should NOT be
used when the transcriber is unclear on the spelling of a word
and wishes to pose alternative spellings.
9. Transcribers should type a contraction whenever a contraction is
clearly heard. In doubtful cases, enter both forms, using the
notation for alternations, e.g. "{they are / they're}".
10. Partial words should be ended with a hyphen. If the transcriber
knows what the word was, the unsaid part should be enclosed in
parentheses, e.g. "San Franc(isco)-". If not, just end the
word with a hyphen, e.g. "... to the f- f- f- initial part ... ".
11. Words that are heard, but can not be identified due to
background noise or bad pronunciation should be enclosed in
double parentheses, e.g. " ... I ((thought)) the answer was
...".
This notation should NOT be used to mark questioned spellings.
12. Accent marks and other diacritics need not be transcribed, but
if they are, these markers should be used:
SYMBOL USAGE EXAMPLE
\3 (acute) add acute accent to next letter resum\3e
\4 (grave) " grave " " " " Amp\4ere
\5 (circumflex) " circumflex " " " " r\5ole
\6 (umlaut/dieresis) " umlaut " " " \6Ohman
\7 (tilde) " tilde " " " ma\7nana
\8 (cedilla) " cedilla " " " fa\8cade
13. All spoken words should be spelled out, e.g., Junior, not Jr.,
Saint not St., etc. Abbreviations should not be used.
The exception to this rule is that Mr., Mrs., Messrs., and Ms.
should be represented in their abbreviated form since no
commonly accepted spellings exist for at least some of them.
Likewise, Letter and number sequences should be spelled out:
D F W, seven forty-seven, U S A, one O one, F B I, etc., unless the
letter sequence is pronounced as a word, as in NASA, ROM, DOS.
14. Words that have been run together, such as "gonna", "y'all",
"kinda", etc., should be transcribed as the separate words:
"going to", "you all", "kind of".
15. Simultaneous talking, where the speech of two speakers overlaps
in time, should be marked by tagging the beginning and ending
of the overlapping sections with a pound sign (#). The speech
of both the talkers should be marked this way, e.g.:
A: I never heard such nonsense, you know, # as I heard that #
B: # Yeah, I know. #
A: day when I blah blah blah
16. If a word or words is clearly heard and understood, but the proper
spelling cannot be determined, an "@" should be prepended to the
word or words in question. This may occur frequently
with proper names. ALL occurrences of questioned words should
contain this notation, not just the first.
e.g.,
"... Israeli prime minister @Yitzhak @Rabin today and ..."