Voice of America Czech News Speech and Transcripts 1.0 INTRODUCTION Between February 9 and May 28, 1999, the Linguistic Data Consortium collected approximately 30 hours of broadcast audio from the Voice of America news service in Czech. The 62 data files presented in this corpus represent daily broadcasts of 30-minute news programs, recorded as single-channel, SPHERE-formatted digital audio files, with 16 KHz sample rate and 16-bit linear PCM samples. The transcriptions were created by native Czech speakers, working at the Department of Cybernetics, University of West Bohemia (UWB) in Pilsen, under the direction of Josef Psutka and Pavel Ircing. They used transcription software provided by the LDC (the "transcriber" package, developed by Eduoard Geoffrois and Claude Barras at DGA, France, with assistance from Zhibiao Wu at the LDC; the package is currently available from the LDC web site: http://www.ldc.upenn.edu). The version of transcriber used for this project produced a text file format which is no longer supported by the current version of the software; also, the format does not resemble any previous transcription format published by the LDC. It was therefore decided to transform the files created at UWB into an SGML format that has been used previously for other broadcast news transcription corpora. The transcript files are presented here in a format that was defined by the speech group at NIST, who refer to it as the "Universal Transcription Format" (UTF -- not to be confused with the "Unicode Transformation Formats"). A separate description of the UTF SGML format is provided in the files "utf.ps" (Postscript) and "utf.pdf" (Adobe Acrobat), and the formal SGML definition is provided in "utf.dtd", all in the "doc" directory. A useful summary of the format, along with additional information about its application to the VOA Czech transcripts, is provided below. The transcription text is rendered using the ISO 8859-2 character set. Information relating this character set to the Unicode standard is available at http://czyborra.com/charsets/iso8859.html, and from the Unicode Consortium (http://www.unicode.org). Due to technical limitations in the hardware at LDC that was used to receive the VOA broadcasts via a satellite downlink, a number of files contain brief portions where the audio signal was interrupted. These interruptions typically yielded regions of complete silence that lasted a less than two seconds and were scattered sparsely throughout an affected audio file. Additional markup was provided in the transcription texts to isolate the regions where these interruptions occurred. 2.0 DIRECTORIES AND FILES Each cd-rom in the corpus contains the following items in the top-level directory: readme.txt -- this text file inventory.tbl -- list of all speech data files doc -- directory containing lexicon and other info trans -- directory containing all transcript files speech -- directory containing a subset of speech files The contents of the first four items are identical across all cd-roms -- that is, all the documentation and all transcript files can be found on any cd-rom in the set. The contents of the speech directory will vary from one disc to the next, and the "inventory.tbl" file shows which disc contains each of the speech data files. The doc directory contains PostScript and Adobe PDF documentation files describing the NIST UTF transcription format, and the SGML Document Type Description file "utf.dtd", which can be used with a standard SGML parser to process the transcript files. (The transcripts can of course also be processed as plain text data using other tools -- an SGML parser is not required.) In addition, the following two files have been provided by UWB as part of their transcription project: vocabulary.tbl assim_vocab.tbl The contents of these files are explained below (section 4.0, Lexicon Files). The doc directory also includes PostScript and PDF documents that list the phonological segment inventory of Czech, with the alphabetic symbols used to represent each segment in the lexicon files. The trans and speech directories contain the respective data files, named according to the following pattern: voa_cze_YYYYMMDD.utf -- transcript file name voa_cze_YYYYMMDD.sph -- speech file name where YYYYMMDD represent the year, month and date of the recorded broadcast. 3.0 TRANSCRIPTION CONVENTIONS Briefly stated, the SGML structure of each transcript is as follows:
... ...
...
That is, each file is a single "utf" element, containing a single "bn_episode_trans" element; the latter contains a series of "section" elements, and some of these contain one or more "turn" elements; additional time stamps are provided at sentence boundaries within the longer "turns". All time values are in seconds, measured from the beginning of the speech data file. The "section" and "turn" contents are explained in more detail below. 3.1 Sections For each speech data file, the corresponding transcript covers the full extent of the audio signal, and divides it into "sections", marked with SGML "
" tags. Each section tag contains attributes to identify its starting time, ending time, and what type of section it is. There are two types of sections:
A "nontrans" section represents a region of the broadcast that does not have transcribable content (e.g. an extended region of music, silence or noise); a "report" section contains utterances by one or more speakers presenting information. In the original NIST UTF specification, a "report" section is meant to span the full extent of a single news story on a given topic, and "nontrans" sections (as well as a third type, "filler" sections) are intended to cover regions between news stories that have no transcribed content (or, with "filler", regions of transcribable speech that are not part of any news story). But in the UWB transcription practice, the "nontrans" section tags were also used to mark regions where the audio signal was interrupted during the recording. As a result, a "report" section might not contain a complete news story, and two or more "report" sections that are separated by brief "nontrans" sections might represent a single story that happens to contain one or more signal interruptions. The transcribers provided additional markup to indicate when a "nontrans" section was being inserted because of a signal interruption, and this information has been preserved in the UTF format by means of SGML "comment" lines within the "nontrans" section, as follows:
Either or both of the "prior" and "following" comments may be present in the "nontrans" section, indicating that the previous and/or following "report" section is not a complete story; if neither comment is present, the "nontrans" section should represent an actual program break between distinct news stories. Also, the transcript may contain two consecutive "report" sections with no "nontrans" section between them, indicating a transition from one news story to another without an interruption or program break. 3.2 Turns Every "report" section contains one or more "turn" elements, representing the regions attributed to individual speakers. The attributes of the SGML "turn" tag indicate the start and end time of the speaker turn, the speaker's gender, and name (or a unique identification string where the name is not given during the broadcast); for example: When the speaker name is known, it is presented in standard Czech orthography, including accented characters where appropriate; the same speaker name can be found in multiple files, and should always identify the same voice. When a speaker's name cannot be determined from the broadcast content, an arbitrary string is provided to identify the speaker; these "anonymous" speaker names have the form: "spN_MMDD", where "N" uniquely identifies the speaker within the file, and "MMDD" identifies the month and date of the broadcast. The same string is used throughout a given file whenever the same voice occurs in different turns within the one broadcast, but no attempt has been made to relate the identity of anonymous speakers across files (for example, speaker "sp1_0209" may or may not be the same voice as "sp2_0212", because these two speaker names occur in different files). Within the span of each "turn" element, the spoken content is presented in standard Czech orthography, together with conventional punctuation (periods, commas, question marks). The UWB transcribers added markup to indicate audible non-speech noises made by the speaker -- note that these differ from spoken words by virtue of an initial open-curly-bracket character: {breath {lipsmack {laugh There is also a special markup for "hesitation sounds" (i.e. "filled pauses"), commonly found in spontaneous speech -- such sounds are marked with an initial "%" character and one of three standard "spellings": %er %mm %um Because signal interruptions sometimes obliterated portions of words, the transcribers added markup to indicate word fragments, where either the initial or the final part of the word is missing from the recording; for example: orovi bo In some brief regions of signal, a speaker's voice is audible, but the words are too indistinct to be transcribed reliably; these portions are marked with double parentheses "(( ))", and in some cases, they contain the transcriber's best guess as to what was said. The transcribers also added markup to indicate various kinds of audible background noise that occurred during speech -- note that these differ from spoken words by virtue of being all UPPER-CASE, with an initial open-square-bracket character: [PAPER_RUSTLE [NOISE [DOOR_SLAM [REMOTE_ENGINE [CHAIR_SQUEAK Often, these background noises would span two or more consecutive spoken words, and markup was provided to identify the extent of speech affected by noise. In order to simplify the use of these annotations, the LDC's conversion of the text to UTF format used the following approach: - Each spoken word token (and each "curly-brace" non-speech token) is presented alone on one line in the transcript file; punctuation marks are also separated from word tokens and are presented alone on one line. - If a spoken word token is affected by a background noise, the noise markup is included on the same line with the word token. - If a background noise occurs during a pause between words, it is presented alone on one line. Except for the "" notation, SGML tags and spoken words are never placed together on one line. Therefore, each line in a transcript file can easily be identified as containing exactly one of the following items: - an SGML tag - a noise annotation (with initial curly or square bracket) - a punctuation mark - a spoken word (possibly with an adjoining square-bracketed noise) - double parentheses to indicate an indistinct region of speech As an additional aid, the lines that contain a spoken word, fragment, punctuation mark or double parentheses, all begin with a single initial space character; SGML tag lines always begin with initial "<" (open angle bracket), and noise annotation lines begin with initial "[" or "{". 3.3 Additional conventions - All spoken numeric words have been transcribed orthographically, not with digit characters. - If an acronym was spoken as a word (e.g. "NATO") it is written as a word ("NATO"); if it was spoken as a sequence of letters (e.g. "NSF") it is written according to the actual Czech pronunciation, either "N S F" or "EN ES EF". - Mispronounced but intelligible words are marked with an initial "+" character; e.g. the word "smluvili" mispronounced as "smlouvili" is written as +smlouvili in the transcript (note: for these mispronounced items, the spelling reflects the incorrect pronunciation that was observed, NOT the standard orthography of the word that was intended). 4.0 LEXICON FILES Two vocabularies were built from words present in the annotated texts. Each vocabulary file has one word entry per line, consisting of two columns separated by a single tab character: - Column one is the orthographic form of the word as it appears in the transcript (including an attached "" tag or "+" marker, where these occur). - Column two is the corresponding phonemic form of the word, in which the phoneme segments are each represented by one or two ascii characters, and are separated by single space characters Here are a few sample entries: agenti a g e n tj i agenti a g e nj tj i agentura a g e n t u r a agentura a g e n t uu r a agenty a g e n t i Two distinct lexicon files are provided: a) The file "vocabulary.tbl" contains all the "intrinsic" phonetic forms for the 25,269 headwords -- that is, the (single or multiple) pronunciations that are possible for each word, regardless of the phrasal context (e.g. vowel length variation, or other alternations: "a k t i v a" vs. "a k t ii v a", "s e d m" vs. "s e d u m", etc). b) The file "vocab_assim.tbl" contains all the same headwords and pronunciations as "vocabulary.tbl", plus additional pronunciations for many words that would result from particular phrasal contexts (e.g. voicing of final voiceless consonants when the following word begins with a voiced phoneme: "n a d j e s t" vs. "n a d j e z d"). Foreign words (names etc.) were transcribed as they were heard by the human transcribers (so the same foreign word may have two or more distinct phonetic transcriptions if it was present in two or more recordings that were transcribed by different people). The following tables summarize the distribution of multiple pronunciations in each of the lexicon files: Number of headwords with ... this many pronunciations in vocabulary.tbl: 23322 one pronunciation 1831 two pronunciations 66 three " 36 four " 4 five " 3 six " 1 seven " 3 eight " 1 nine " 1 twelve " 1 thirteen " 25269 headwords, 27429 total entries in vocab_assim.tbl: 19660 one pronunciation 5202 two pronunciations 55 three " 310 four " 6 five " 21 six " 2 seven " 9 eight " 9 nine " 1 twelve " 1 fourteen " 1 twenty-six " 25269 headwords, 31772 total entries Note that the headwords in both lexicon files are rendered in mono-case (lower-case only), whereas the transcription files use conventional rules for capitalizing proper names and the initial word of each sentence.