Top-Level Documentation for HUB-4 Mandarin Transcript Data ---------------------------------------------------------- This distribution contains files of completed transcription data for the training set of the 1997 DARPA HUB-4 Mandarin Benchmark, together with supporting files and documentation for using standard SGML utilities to access the transcript data. The transcript files themselves have been created in a manner that directly supports the use of a standard SGML parsing utility to extract the transcription text and associated information in each file. The supporting files provide SGML Document Type Declarations (DTD's) that fully describe the structure and content of the transcript files; the DTD's are provided as input to an SGML parser along with the transcript files in order to verify the format of the transcript file and extract its contents. A separate documentation file, "hub4sgml.doc", explains how to obtain and use an SGML parser utility, and what can be done with each of the DTD files. Detailed information about the transcription conventions, word segmentation principles and SGML structure of the transcripts are provided as HTML documents. The remainder of the current documentation file (h4m_tran.doc) provides overview information about the transcripts and how they relate to the associated acoustic data (which have been published separately on CD-ROM). DATA SOURCES FOR THE 1997 HUB-4 MANDARIN COLLECTION --------------------------------------------------- This collection will ultimately include materials that have been recorded from broadcasts by the following sources: Voice of America (VOA) -- United States Information Agency radio CCTV -- People's Republic of China television KAZN -- Commercial radio based in Los Angeles, CA. Of these three sources, the first two comprise the bulk of the collection, and will be represented in roughly equal amounts; only a relatively small sample of KAZN recordings will be included, owing to the relatively high proportion of unusable material (commercials, local traffic reports loaded with California place names, etc). The following table indicates the relative amounts of data from each source, in terms of number of files, number of hours of broadcast recordings (i.e. in the speech files published on CD-ROM), and number of hours of actual transcribed speech data (i.e. time bounded by turn tags in the transcripts). No. of Hours Hours Source Files Recorded Transcribed ===================================== CCTV 25 13.0 11.7 KAZN 9 4.5 2.7 VOA 24 24.0 15.8 ------------------------------------- Total: 58 41.5 30.2 (The apparent low yield of VOA recordings was due to the presence of some non-news program segments in the acoustic files, as well as the presence of repeated news content across files recorded on the same date; when a given speaker read the same news text more than once in the recorded collection, only one of those readings was transcribed.) ORGANIZATION OF DATA FILES -------------------------- The names of individual transcript files indicate the language, source and date of the broadcast, as follows: 1st character: language (i.e. "m" for Mandarin) 2nd character: source ("v" for VOA, "c" for CCTV, "k" for KAZN) 3rd-7th chars: date of broadcast (YYMDD; e.g. "97418") 8th character: if present, one of "a", "b", "c", "d" The seven- or eight-character name matches the name of the corresponding acoustic data file, which the LDC is providing via CD-ROM. The transcript and speech files are distinguished by the 3-character "extension" to the file name, as follows: fileid.sph : speech file on CD-ROM fileid.sgm : transcript file (in SGML format) The "date of broadcast" field of the file names will use the letters "A", "B" and "C" to represent the months October, November and December, respectively; the digits 1-9 represent January through September in the normal way. Note that the KAZN files have a "0" for the month field -- this is because the recordings received by the LDC from KAZN were not labeled with date of broadcast, and we have not recovered the broadcast date from the content of the recordings. For this material, we arbitrarily assigned a unique 3-digit number to each 30-minute recording, starting at "001", and used this in place of the "MDD" portion of the broadcast date field of the file name. It is likely that the sequence implied by these arbitrary numbers DOES NOT correspond to the actual time sequence of the broadcasts. In several cases, two or more VOA recordings were made on the same day (or a single large recording was broken up into two or more smaller files). In these cases, the first seven characters of the file names (the language, source and date fields) are identical, and an eighth character is added to distinguish the individual files. This eighth character will be one of "a,b,c,d", and will represent actual broadcast sequence of the files (i.e. the "a" file will have been broadcast prior to the "b" file, and so on). CONTENT OF TRANSCRIPT FILES --------------------------- The content of the transcript files can be categorized into two types of character data: (1) SGML (ASCII-encoded) markup data (2) Mandarin (GB-encoded) text data (with some ASCII notations) These two types of data never occur together on the same line -- that is, each line of a transcript file contains either SGML markup or transcription text, but never both. The SGML markup provides division of the text data into a hierarchical structure of "sections" (defined on the basis of topic), "turns" (defined on the basis of change of speaker), and "overlap" (regions where two people are speaking at once). For sections, the markup indicates the type of section (one of: "nontranscribed", "filler" or "report"); for turns, it indicates the gender and a unique identifying string for each speaker. It also establishes the timing information, in units of seconds, for correlating the transcription text to the acoustic data. The HTML document file "sgmlspec.html" provides more detail about the structure and meaning of the markup content. With regard to identification of speakers in the SGML turn tags, we have sought to identify speakers by name wherever possible; the speaker's given name is provided in ASCII (pinyin) form as a single attribute token within the turn tag (e.g. "speaker=Zeng_Yucheng"). Every speaker whose given name was not determinable from the recorded broadcast was assigned an anonymous but uniquely indexed string, such as "spkr_21" or "reporter_37". In applying these anonymous labels to turns within a file, the transcribers were instructed as follows: make sure that a given label is not applied to more than one distinct voice, and try as far as possible to apply the same label every time the same voice is heard. No attempt was made to correlate the identity of anonymous voices across files. Owing to the nature of the task, it is likely that a single speaker will appear in different files, or at different points in the same file, and be identified with different labels (this may affect some cases of named speakers as well). While it is also possible that some mistakes have been made in applying the same label to different speakers, this type of identification error should be quite rare. The Mandarin text data consist of 16-bit GB-encoded characters, together with space, new-line and punctuation characters; the punctuation consists of only the following: period, comma and question mark (using the ASCII codes 0x2e, 0x2c and 0x3f, respectively). Spaces are used indicate word segmentation, which has been done manually by the Mandarin transcribers, in accordance with principles described in the HTML document files "ma_segmentation.html" and "ma_principles.html". In addition to the Mandarin text and word separators, there is a small set of curly-brace bracketed tokens to indicate non-speech sounds made by a speaker (e.g. "{laugh}", "{cough}", etc), and a small set of "token classifier" characters, which immediately precede a word token and identify that token as falling into one of the following categories: Character Token Category % non-lexeme (e.g. filled pause or hesitation sound) ^ proper name (i.e. a person's given name or surname) + mispronounced word (correct orthography of intended word is given) The curly braces and token classifier characters can be interpreted as SGML markup by the parser utility, when the appropriate DTD file is used. An alternative DTD can also be used to parse the transcripts while leaving these characters intact (unprocessed). Please refer to "hub4sgml.doc" for further information and examples of usage. The transcripts also include one special notation in double square brackets: "[[NS]]". This is used to identify a region between two consecutive time stamps in which there is no speech. This typically occurs within a turn when the speaker pauses for a significant period of time (two seconds or more), during which there is music, background noise or silence. Hyphens are used to indicate word fragments; the hyphen may occur either at the beginning or the end of the fragment. (A word-initial hyphen indicates that noise or transmission problems during the broadcast obscured or eliminated the beginning of the word.) Hyphens are not given any special treatment by the SGML parser, and are passed through to the parser output unmodified. In some cases, the word fragment is rendered as one or more 16-bit GB characters, while in other cases, it appears as one or more ASCII characters representing a pinyin transcription of the fragment.