Documentation for Voice of America (VOA) Broadcast News Czech Audio

1.0 Introduction

This file contains documentation on the VOA Broadcast News Czech Audio Corpus, Linguistic Data Consortium (LDC) catalog number LDC2000S89 and isbn 1-58563-179-5. We have included below, as reference material, the documentation for the separate VOA Broadcast News Czech Transcripts Corpus, also produced by the Linguistic Data Consortium (LDC) as Catalog Number LDC2000T53 and isbn 1-58563-180-9.

Between February 9 and May 28, 1999, the Linguistic Data Consortium collected approximately 30 hours of broadcast audio from the Voice of America news service in Czech. The 62 data files presented in this corpus represent the audio of the daily broadcasts of 30-minute news programs.

The transcriptions were created by native Czech speakers, working at the Department of Cybernetics, University of West Bohemia (UWB) in Pilsen, under the direction of Josef Psutka and Pavel Ircing. They used transcription software provided by the LDC (the "transcriber" package, developed by Eduoard Geoffrois and Claude Barras at DGA, France, with assistance from Zhibiao Wu at the LDC; the package is currently available from the LDC web site: www.ldc.upenn.edu.

The version of transcriber used for this project produced a text file format which is no longer supported by the current version of the software; also, the format does not resemble any previous transcription format published by the LDC. It was therefore decided to transform the files created at UWB into an SGML format that has been used previously for other broadcast news transcription corpora.

The transcript files are presented here in a format that was defined by the speech group at NIST, who refer to it as the "Universal Transcription Format" (UTF -- not to be confused with the "Unicode Transformation Formats"). A separate description of the UTF SGML format is provided in the files "utf.ps" (Postscript) and "utf.pdf" (Adobe Acrobat), and the formal SGML definition is provided in "utf.dtd", all in the "doc" directory. A useful summary of the format, along with additional information about its application to the VOA Czech transcripts, is provided below.

The transcription text is rendered using the ISO 8859-2 character set. Information relating this character set to the Unicode standard is available at czyborra.com/charsets/iso8859.html, and from the Unicode Consortium www.unicode.org.

Due to technical limitations in the hardware at LDC that was used to receive the VOA broadcasts via a satellite downlink, a number of files contain brief portions where the audio signal was interrupted. These interruptions typically yielded regions of complete silence that lasted less than two seconds and were scattered sparsely throughout an affected audio file. Additional markup was provided in the transcription texts to isolate the regions where these interruptions occurred.

2.0 Data Structure

The sixty-two audio files in this corpus are single-channel, 16 KHz, 16 bit linear SPHERE files.

3.0 File Structure

This is an FTP publication with the following file structure:

index.html this html file
doc directory containing lexicon and other information
vocab_assim.tbl these files are explained in section 4.0, Lexicon Files
vocabulary.tbl these files are explained in section 4.0, Lexicon Files
phon_seg_tbl.pdf PDF documentation that lists the phonological segment inventory of Czech, with the alphabetic symbols used to represent each segment in the lexicon files
phon_seg_tbl.ps PS documentation that lists the phonological segment inventory of Czech, with the alphabetic symbols used to represent each segment in the lexicon files
utf.pdf PDF documentation files describing the NIST UTF transcription format
utf.ps PS documentation files describing the NIST UTF transcription format
utf.dtd SGML Document Type Description file
audio directory containing all audio files

The audio directory contains the data files, named according to the following pattern:

voa_cze_YYYYMMDD.sph

where YYYYMMDD represent the year, month and date of the recorded broadcast.

4.0 Transcription Conventions

Briefly stated, the SGML structure of each transcript is as follows:

<utf ...>
 <bn_episode_trans ...>
  <section type="nontrans"...>
  </section>
  <section type="report"...>
   <turn ...>
    ...
    <time sec="67.468">
    ...
   </turn>
   ...
  </section>
  ...
 </bn_episode_trans>
</utf>

That is, each file is a single "utf" element, containing a single "bn_episode_trans" element; the latter contains a series of "section" elements, and some of these contain one or more "turn" elements; additional time stamps are provided at sentence boundaries within the longer "turns". All time values are in seconds, measured from the beginning of the speech data file. The "section" and "turn" contents are explained in more detail below.

4.1 Sections

For each speech data file, the corresponding transcript covers the full extent of the audio signal, and divides it into "sections", marked with SGML "<section ...>" tags. Each section tag contains attributes to identify its starting time, ending time, and what type of section it is. There are two types of sections:

<section type="nontrans" startTime=... endTime=...>
<section type="report" startTime=... endTime=...>

A "nontrans" section represents a region of the broadcast that does not have transcribable content (e.g. an extended region of music, silence or noise); a "report" section contains utterances by one or more speakers presenting information.

In the original NIST UTF specification, a "report" section is meant to span the full extent of a single news story on a given topic, and "nontrans" sections (as well as a third type, "filler" sections) are intended to cover regions between news stories that have no transcribed content (or, with "filler", regions of transcribable speech that are not part of any news story).

But in the UWB transcription practice, the "nontrans" section tags were also used to mark regions where the audio signal was interrupted during the recording. As a result, a "report" section might not contain a complete news story, and two or more "report" sections that are separated by brief "nontrans" sections might represent a single story that happens to contain one or more signal interruptions. The transcribers provided additional markup to indicate when a "nontrans" section was being inserted because of a signal interruption, and this information has been preserved in the UTF format by means of SGML "comment" lines within the "nontrans" section, as follows:

<section type="nontrans" startTime="..." endTime="...">
<!--prior signal interrupted -->
<!--following signal interrupted -->
</section>

Either or both of the "prior" and "following" comments may be present in the "nontrans" section, indicating that the previous and/or following "report" section is not a complete story; if neither comment is present, the "nontrans" section should represent an actual program break between distinct news stories. Also, the transcript may contain two consecutive "report" sections with no "nontrans" section between them, indicating a transition from one news story to another without an interruption or program break.

4.2 Turns

Every "report" section contains one or more "turn" elements, representing the regions attributed to individual speakers. The attributes of the SGML "turn" tag indicate the start and end time of the speaker turn, the speaker's gender, and name (or a unique identification string where the name is not given during the broadcast); for example:

<turn startTime="..." endTime="..." speaker="sp2..." spkrtype="male">

When the speaker name is known, it is presented in standard Czech orthography, including accented characters where appropriate; the same speaker name can be found in multiple files, and should always identify the same voice.

When a speaker's name cannot be determined from the broadcast content, an arbitrary string is provided to identify the speaker; these "anonymous" speaker names have the form: "spN_MMDD", where "N" uniquely identifies the speaker within the file, and "MMDD" identifies the month and date of the broadcast. The same string is used throughout a given file whenever the same voice occurs in different turns within the one broadcast, but no attempt has been made to relate the identity of anonymous speakers across files (for example, speaker "sp1_0209" may or may not be the same voice as "sp2_0212", because these two speaker names occur in different files).

Within the span of each "turn" element, the spoken content is presented in standard Czech orthography, together with conventional punctuation (periods, commas, question marks). The UWB transcribers added markup to indicate audible non-speech noises made by the speaker -- note that these differ from spoken words by virtue of an initial open-curly-bracket character:

{breath
{lipsmack
{laugh

There is also a special markup for "hesitation sounds" (i.e. "filled pauses"), commonly found in spontaneous speech -- such sounds are marked with an initial "%" character and one of three standard "spellings":

%er
%mm
%um

Because signal interruptions sometimes obliterated portions of words, the transcribers added markup to indicate word fragments, where either the initial or the final part of the word is missing from the recording; for example:

<fragment>orovi
bo<fragment>

In some brief regions of signal, a speaker's voice is audible, but the words are too indistinct to be transcribed reliably; these portions are marked with double parentheses "(( ))", and in some cases, they contain the transcriber's best guess as to what was said.

The transcribers also added markup to indicate various kinds of audible background noise that occurred during speech -- note that these differ from spoken words by virtue of being all UPPER-CASE, with an initial open-square-bracket character:

[PAPER_RUSTLE
[NOISE
[DOOR_SLAM
[REMOTE_ENGINE
[CHAIR_SQUEAK

Often, these background noises would span two or more consecutive spoken words, and markup was provided to identify the extent of speech affected by noise. In order to simplify the use of these annotations, the LDC's conversion of the text to UTF format used the following approach:

- Each spoken word token (and each "curly-brace" non-speech token) is presented alone on one line in the transcript file; punctuation marks are also separated from word tokens and are presented alone on one line.

- If a spoken word token is affected by a background noise, the noise markup is included on the same line with the word token.

- If a background noise occurs during a pause between words, it is presented alone on one line.

Except for the "" notation, SGML tags and spoken words are never placed together on one line. Therefore, each line in a transcript file can easily be identified as containing exactly one of the following items:

- an SGML tag

- a noise annotation (with initial curly or square bracket)

- a punctuation mark

- a spoken word (possibly with an adjoining square-bracketed noise)

- double parentheses to indicate an indistinct region of speech

As an additional aid, the lines that contain a spoken word, fragment, punctuation mark or double parentheses, all begin with a single initial space character; SGML tag lines always begin with initial "<" (open angle bracket), and noise annotation lines begin with initial "[" or "{".

4.3 Additional conventions

- All spoken numeric words have been transcribed orthographically, not with digit characters.

- If an acronym was spoken as a word (e.g. "NATO") it is written as a word ("NATO"); if it was spoken as a sequence of letters (e.g. "NSF") it is written according to the actual Czech pronunciation, either "N S F" or "EN ES EF".

- Mispronounced but intelligible words are marked with an initial "+" character; e.g. the word "smluvili" mispronounced as "smlouvili" is written as +smlouvili in the transcript (note: for these mispronounced items, the spelling reflects the incorrect pronunciation that was observed, NOT the standard orthography of the word that was intended).

5.0 Lexicon Files

Two vocabularies were built from words present in the annotated texts. Each vocabulary file has one word entry per line, consisting of two columns separated by a single tab character:

- Column one is the orthographic form of the word as it appears in the transcript (including an attached "<fragment>" tag or "+" marker, where these occur).

- Column two is the corresponding phonemic form of the word, in which the phoneme segments are each represented by one or two ascii characters, and are separated by single space characters

Here are a few sample entries:

agenti	        a g e n tj i
agenti	        a g e nj tj i
agentura        a g e n t u r a
agentura	a g e n t uu r a
agenty	        a g e n t i

Two distinct lexicon files are provided:

a) The file "vocabulary.tbl" contains all the "intrinsic" phonetic forms for the 25,269 headwords -- that is, the (single or multiple) pronunciations that are possible for each word, regardless of the phrasal context (e.g. vowel length variation, or other alternations: "a k t i v a" vs. "a k t ii v a", "s e d m" vs. "s e d u m", etc).

b) The file "vocab_assim.tbl" contains all the same headwords and pronunciations as "vocabulary.tbl", plus additional pronunciations for many words that would result from particular phrasal contexts (e.g. voicing of final voiceless consonants when the following word begins with a voiced phoneme: "n a d j e s t" vs. "n a d j e z d").

Foreign words (names etc.) were transcribed as they were heard by the human transcribers (so the same foreign word may have two or more distinct phonetic transcriptions if it was present in two or more recordings that were transcribed by different people).

The following tables summarize the distribution of multiple pronunciations in each of the lexicon files:

 
             Number of headwords with - this many pronunciations

	in vocabulary.tbl:
				23322	one pronunciation
				 1831	two pronunciations
				   66	three      "
				   36	four	   "
				    4	five	   "
				    3	six	   "
				    1	seven	   "
				    3	eight	   "
				    1	nine	   "
				    1	twelve	   "
				    1	thirteen   "
				25269 headwords, 27429 total entries

	in vocab_assim.tbl:
				19660	one pronunciation
				 5202	two pronunciations
				   55	three      "
				  310	four	   "
				    6	five	   "
				   21	six	   "
				    2	seven	   "
				    9	eight	   "
				    9	nine	   "
				    1	twelve	   "
				    1	fourteen   "
				    1	twenty-six "
				25269 headwords, 31772 total entries

Note that the headwords in both lexicon files are rendered in mono-case (lower-case only), whereas the transcription files use conventional rules for capitalizing proper names and the initial word of each sentence.

Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2000S89.

Content Copyright

Portions © 1999 Voice of America


Contact: ldc@ldc.upenn.edu
© 1996-2000 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.