Voice of America Czech News Speech and Transcripts
1.0 INTRODUCTION
Between February 9 and May 28, 1999, the Linguistic Data Consortium
collected approximately 30 hours of broadcast audio from the Voice of
America news service in Czech. The 62 data files presented in this
corpus represent daily broadcasts of 30-minute news programs, recorded
as single-channel, SPHERE-formatted digital audio files, with 16 KHz
sample rate and 16-bit linear PCM samples.
The transcriptions were created by native Czech speakers, working at
the Department of Cybernetics, University of West Bohemia (UWB) in
Pilsen, under the direction of Josef Psutka and Pavel Ircing. They
used transcription software provided by the LDC (the "transcriber"
package, developed by Eduoard Geoffrois and Claude Barras at DGA,
France, with assistance from Zhibiao Wu at the LDC; the package is
currently available from the LDC web site: http://www.ldc.upenn.edu).
The version of transcriber used for this project produced a text file
format which is no longer supported by the current version of the
software; also, the format does not resemble any previous
transcription format published by the LDC. It was therefore decided
to transform the files created at UWB into an SGML format that has
been used previously for other broadcast news transcription corpora.
The transcript files are presented here in a format that was defined
by the speech group at NIST, who refer to it as the "Universal
Transcription Format" (UTF -- not to be confused with the "Unicode
Transformation Formats"). A separate description of the UTF SGML
format is provided in the files "utf.ps" (Postscript) and "utf.pdf"
(Adobe Acrobat), and the formal SGML definition is provided in
"utf.dtd", all in the "doc" directory. A useful summary of the
format, along with additional information about its application to the
VOA Czech transcripts, is provided below.
The transcription text is rendered using the ISO 8859-2 character set.
Information relating this character set to the Unicode standard is
available at http://czyborra.com/charsets/iso8859.html, and from the
Unicode Consortium (http://www.unicode.org).
Due to technical limitations in the hardware at LDC that was used to
receive the VOA broadcasts via a satellite downlink, a number of files
contain brief portions where the audio signal was interrupted. These
interruptions typically yielded regions of complete silence that
lasted a less than two seconds and were scattered sparsely throughout
an affected audio file. Additional markup was provided in the
transcription texts to isolate the regions where these interruptions
occurred.
2.0 DIRECTORIES AND FILES
Each cd-rom in the corpus contains the following items in the
top-level directory:
readme.txt -- this text file
inventory.tbl -- list of all speech data files
doc -- directory containing lexicon and other info
trans -- directory containing all transcript files
speech -- directory containing a subset of speech files
The contents of the first four items are identical across all cd-roms
-- that is, all the documentation and all transcript files can be
found on any cd-rom in the set. The contents of the speech directory
will vary from one disc to the next, and the "inventory.tbl" file
shows which disc contains each of the speech data files.
The doc directory contains PostScript and Adobe PDF documentation
files describing the NIST UTF transcription format, and the SGML
Document Type Description file "utf.dtd", which can be used with a
standard SGML parser to process the transcript files. (The
transcripts can of course also be processed as plain text data using
other tools -- an SGML parser is not required.) In addition, the
following two files have been provided by UWB as part of their
transcription project:
vocabulary.tbl
assim_vocab.tbl
The contents of these files are explained below (section 4.0, Lexicon
Files). The doc directory also includes PostScript and PDF documents
that list the phonological segment inventory of Czech, with the
alphabetic symbols used to represent each segment in the lexicon
files.
The trans and speech directories contain the respective data files,
named according to the following pattern:
voa_cze_YYYYMMDD.utf -- transcript file name
voa_cze_YYYYMMDD.sph -- speech file name
where YYYYMMDD represent the year, month and date of the recorded
broadcast.
3.0 TRANSCRIPTION CONVENTIONS
Briefly stated, the SGML structure of each transcript is as follows:
...
That is, each file is a single "utf" element, containing a single
"bn_episode_trans" element; the latter contains a series of "section"
elements, and some of these contain one or more "turn" elements;
additional time stamps are provided at sentence boundaries within the
longer "turns". All time values are in seconds, measured from the
beginning of the speech data file. The "section" and "turn" contents
are explained in more detail below.
3.1 Sections
For each speech data file, the corresponding transcript covers the
full extent of the audio signal, and divides it into "sections",
marked with SGML "" tags. Each section tag contains
attributes to identify its starting time, ending time, and what type
of section it is. There are two types of sections:
A "nontrans" section represents a region of the broadcast that does
not have transcribable content (e.g. an extended region of music,
silence or noise); a "report" section contains utterances by one or
more speakers presenting information.
In the original NIST UTF specification, a "report" section is meant to
span the full extent of a single news story on a given topic, and
"nontrans" sections (as well as a third type, "filler" sections) are
intended to cover regions between news stories that have no
transcribed content (or, with "filler", regions of transcribable
speech that are not part of any news story).
But in the UWB transcription practice, the "nontrans" section tags
were also used to mark regions where the audio signal was interrupted
during the recording. As a result, a "report" section might not
contain a complete news story, and two or more "report" sections that
are separated by brief "nontrans" sections might represent a single
story that happens to contain one or more signal interruptions. The
transcribers provided additional markup to indicate when a "nontrans"
section was being inserted because of a signal interruption, and this
information has been preserved in the UTF format by means of SGML
"comment" lines within the "nontrans" section, as follows:
Either or both of the "prior" and "following" comments may be present
in the "nontrans" section, indicating that the previous and/or
following "report" section is not a complete story; if neither comment
is present, the "nontrans" section should represent an actual program
break between distinct news stories. Also, the transcript may contain
two consecutive "report" sections with no "nontrans" section between
them, indicating a transition from one news story to another without
an interruption or program break.
3.2 Turns
Every "report" section contains one or more "turn" elements,
representing the regions attributed to individual speakers. The
attributes of the SGML "turn" tag indicate the start and end time of
the speaker turn, the speaker's gender, and name (or a unique
identification string where the name is not given during the
broadcast); for example:
When the speaker name is known, it is presented in standard Czech
orthography, including accented characters where appropriate; the same
speaker name can be found in multiple files, and should always
identify the same voice.
When a speaker's name cannot be determined from the broadcast content,
an arbitrary string is provided to identify the speaker; these
"anonymous" speaker names have the form: "spN_MMDD", where "N"
uniquely identifies the speaker within the file, and "MMDD" identifies
the month and date of the broadcast. The same string is used
throughout a given file whenever the same voice occurs in different
turns within the one broadcast, but no attempt has been made to relate
the identity of anonymous speakers across files (for example, speaker
"sp1_0209" may or may not be the same voice as "sp2_0212", because
these two speaker names occur in different files).
Within the span of each "turn" element, the spoken content is
presented in standard Czech orthography, together with conventional
punctuation (periods, commas, question marks). The UWB transcribers
added markup to indicate audible non-speech noises made by the
speaker -- note that these differ from spoken words by virtue of
an initial open-curly-bracket character:
{breath
{lipsmack
{laugh
There is also a special markup for "hesitation sounds" (i.e. "filled
pauses"), commonly found in spontaneous speech -- such sounds are
marked with an initial "%" character and one of three standard
"spellings":
%er
%mm
%um
Because signal interruptions sometimes obliterated portions of words,
the transcribers added markup to indicate word fragments, where either
the initial or the final part of the word is missing from the
recording; for example:
orovi
bo
In some brief regions of signal, a speaker's voice is audible, but the
words are too indistinct to be transcribed reliably; these portions
are marked with double parentheses "(( ))", and in some cases, they
contain the transcriber's best guess as to what was said.
The transcribers also added markup to indicate various kinds of
audible background noise that occurred during speech -- note that
these differ from spoken words by virtue of being all UPPER-CASE, with
an initial open-square-bracket character:
[PAPER_RUSTLE
[NOISE
[DOOR_SLAM
[REMOTE_ENGINE
[CHAIR_SQUEAK
Often, these background noises would span two or more consecutive
spoken words, and markup was provided to identify the extent of speech
affected by noise. In order to simplify the use of these annotations,
the LDC's conversion of the text to UTF format used the following
approach:
- Each spoken word token (and each "curly-brace" non-speech token) is
presented alone on one line in the transcript file; punctuation
marks are also separated from word tokens and are presented alone
on one line.
- If a spoken word token is affected by a background noise, the noise
markup is included on the same line with the word token.
- If a background noise occurs during a pause between words, it is
presented alone on one line.
Except for the "" notation, SGML tags and spoken words are
never placed together on one line. Therefore, each line in a
transcript file can easily be identified as containing exactly one of
the following items:
- an SGML tag
- a noise annotation (with initial curly or square bracket)
- a punctuation mark
- a spoken word (possibly with an adjoining square-bracketed noise)
- double parentheses to indicate an indistinct region of speech
As an additional aid, the lines that contain a spoken word, fragment,
punctuation mark or double parentheses, all begin with a single
initial space character; SGML tag lines always begin with initial "<"
(open angle bracket), and noise annotation lines begin with initial
"[" or "{".
3.3 Additional conventions
- All spoken numeric words have been transcribed orthographically,
not with digit characters.
- If an acronym was spoken as a word (e.g. "NATO") it is written as a
word ("NATO"); if it was spoken as a sequence of letters (e.g.
"NSF") it is written according to the actual Czech pronunciation,
either "N S F" or "EN ES EF".
- Mispronounced but intelligible words are marked with an initial "+"
character; e.g. the word "smluvili" mispronounced as "smlouvili" is
written as +smlouvili in the transcript (note: for these
mispronounced items, the spelling reflects the incorrect
pronunciation that was observed, NOT the standard orthography of
the word that was intended).
4.0 LEXICON FILES
Two vocabularies were built from words present in the annotated texts.
Each vocabulary file has one word entry per line, consisting of two
columns separated by a single tab character:
- Column one is the orthographic form of the word as it appears in
the transcript (including an attached "" tag or "+"
marker, where these occur).
- Column two is the corresponding phonemic form of the word, in which
the phoneme segments are each represented by one or two ascii
characters, and are separated by single space characters
Here are a few sample entries:
agenti a g e n tj i
agenti a g e nj tj i
agentura a g e n t u r a
agentura a g e n t uu r a
agenty a g e n t i
Two distinct lexicon files are provided:
a) The file "vocabulary.tbl" contains all the "intrinsic" phonetic
forms for the 25,269 headwords -- that is, the (single or multiple)
pronunciations that are possible for each word, regardless of the
phrasal context (e.g. vowel length variation, or other alternations:
"a k t i v a" vs. "a k t ii v a", "s e d m" vs. "s e d u m", etc).
b) The file "vocab_assim.tbl" contains all the same headwords and
pronunciations as "vocabulary.tbl", plus additional pronunciations for
many words that would result from particular phrasal contexts
(e.g. voicing of final voiceless consonants when the following word
begins with a voiced phoneme: "n a d j e s t" vs. "n a d j e z d").
Foreign words (names etc.) were transcribed as they were heard by the
human transcribers (so the same foreign word may have two or more
distinct phonetic transcriptions if it was present in two or more
recordings that were transcribed by different people).
The following tables summarize the distribution of multiple
pronunciations in each of the lexicon files:
Number of headwords with ... this many pronunciations
in vocabulary.tbl:
23322 one pronunciation
1831 two pronunciations
66 three "
36 four "
4 five "
3 six "
1 seven "
3 eight "
1 nine "
1 twelve "
1 thirteen "
25269 headwords, 27429 total entries
in vocab_assim.tbl:
19660 one pronunciation
5202 two pronunciations
55 three "
310 four "
6 five "
21 six "
2 seven "
9 eight "
9 nine "
1 twelve "
1 fourteen "
1 twenty-six "
25269 headwords, 31772 total entries
Note that the headwords in both lexicon files are rendered in
mono-case (lower-case only), whereas the transcription files use
conventional rules for capitalizing proper names and the initial word
of each sentence.