Top-Level Documentation for HUB-4 Spanish Transcript Data ---------------------------------------------------------- This distribution contains files of completed transcription data for the training set of the 1997 DARPA HUB-4 Spanish Benchmark, together with supporting files and documentation for using standard SGML utilities to access the transcript data. The transcript files themselves have been created in a manner that directly supports the use of a standard SGML parsing utility to extract the transcription text and associated information in each file. The supporting files provide SGML Document Type Declarations (DTD's) that fully describe the structure and content of the transcript files; the DTD's are provided as input to an SGML parser along with the transcript files in order to verify the format of the transcript file and extract its contents. A separate documentation file, "hub4sgml.doc", explains how to obtain and use an SGML parser utility, and what can be done with each of the DTD files. Detailed information about the transcription conventions, word segmentation principles and SGML structure of the transcripts are provided as HTML documents. The remainder of the current documentation file (h4s_tran.doc) provides overview information about the transcripts and how they relate to the associated acoustic data (which have been published separately on CD-ROM). DATA SOURCES FOR THE 1997 HUB-4 SPANISH COLLECTION --------------------------------------------------- This collection will ultimately include materials that have been recorded from broadcasts by the following sources: Voice of America (VOA) -- United States Information Agency radio ECO UNIVISION Of these three sources, the latter two originate in Mexico, but will tend to include speakers from other regions in Latin America; in the VOA broadcasts, Cuban (or other Caribbean) speakers will dominate. The following table indicates the relative amounts of data from each source, in terms of number of files, number of hours of broadcast recordings (i.e. in the speech files published on CD-ROM), and number of hours of actual transcribed speech data (i.e. time bounded by turn tags in the transcripts). No. of Hours Hours Source Files Recorded Transcribed ===================================== ECO 30 17.5 7.8 UNI 24 12.0 7.3 VOA 27 15.5 17.6 ------------------------------------- Total: 81 45.0 32.7 (The apparent low yield for these sources was due primarily to the presence of significant non-news segments in the acoustic files.) ORGANIZATION OF DATA FILES -------------------------- The names of individual transcript files indicate the language, source and date of the broadcast, as follows: 1st character: language (i.e. "s" for Spanish) 2nd character: source ("v" for VOA, "e" for ECO, "u" for Univision) 3rd-7th chars: date of broadcast (YYMDD; e.g. "97418") 8th character: if present, one of "a", "b", "c", "d" The seven- or eight-character name matches the name of the corresponding acoustic data file, which the LDC is providing via CD-ROM. The transcript and speech files are distinguished by the 3-character "extension" to the file name, as follows: fileid.sph : speech file on CD-ROM fileid.sgm : transcript file (in SGML format) The "date of broadcast" field of the file names will use the letters "A", "B" and "C" to represent the months October, November and December, respectively; the digits 1-9 represent January through September in the normal way. In several cases, two or more VOA recordings were made on the same day (or a single large recording was broken up into two or more smaller files). In these cases, the first seven characters of the file names (the language, source and date fields) are identical, and an eighth character is added to distinguish the individual files. This eighth character will be one of "a,b,c,d", and will represent actual broadcast sequence of the files (i.e. the "a" file will have been broadcast prior to the "b" file, and so on). CONTENT OF TRANSCRIPT FILES --------------------------- The content of the transcript files can be categorized into two types of character data: (1) SGML (ASCII-encoded) markup data (2) Spanish (ISOLatin1-encoded) text data (with some ASCII notations) These two types of data never occur together on the same line -- that is, each line of a transcript file contains either SGML markup or transcription text, but never both. The SGML markup provides division of the text data into a hierarchical structure of "sections" (defined on the basis of topic), "turns" (defined on the basis of change of speaker), and "overlap" (regions where two people are speaking at once). For sections, the markup indicates the type of section (one of: "nontranscribed", "filler" or "report"); for turns, it indicates the gender and a unique identifying string for each speaker. It also establishes the timing information, in units of seconds, for correlating the transcription text to the acoustic data. The HTML document file "sgmlspec.html" provides more detail about the structure and meaning of the markup content. With regard to identification of speakers in the SGML turn tags, we have sought to identify speakers by name wherever possible; the speaker's given name is provided in ASCII form as a single attribute token within the turn tag (e.g. "speaker=Joaquin_del_Olmo"). Every speaker whose given name was not determinable from the recorded broadcast was assigned an anonymous but uniquely indexed string, such as "spkr_21" or "reporter_37". In applying these anonymous labels to turns within a file, the transcribers were instructed as follows: make sure that a given label is not applied to more than one distinct voice, and try as far as possible to apply the same label every time the same voice is heard. No attempt was made to correlate the identity of anonymous voices across files. Owing to the nature of the task, it is likely that a single speaker will appear in different files, or at different points in the same file, and be identified with different labels (this may affect some cases of named speakers as well). While it is also possible that some mistakes have been made in applying the same label to different speakers, this type of identification error should be quite rare. Speaker information (as provided in the SGML turn tags) includes sex and dialect classification of the speakers. The dialect categories applied to speakers of Spanish are as follows: Attribute label Meaning ---------------------------------- Coastal Caribbean, Lowland Interior Mainland, Highland Peninsular Spain Non-native (but still speaking Spanish) In addition, for turns in which the speaker did not utter any Spanish, the dialect attribute is given as: Not-Spanish The Spanish text data consist of 8-bit ISOLatin1-encoded characters, together with space, new-line and punctuation characters; the punctuation consists of only the following: period, comma and question mark (using the ASCII codes 0x2e, 0x2c and 0x3f, respectively). In addition to the Spanish text and word separators, there is a small set of curly-brace bracketed tokens to indicate non-speech sounds made by a speaker (e.g. "{laugh}", "{cough}", etc), and a small set of "token classifier" characters, which immediately precede a word token and identify that token as falling into one of the following categories: Character Token Category % non-lexeme (e.g. filled pause or hesitation sound) ^ proper name (i.e. a person's given name or surname) + mispronounced word (correct orthography of intended word is given) _ alphabet letter (an initial or part of an acronym) The curly braces and token classifier characters (except for the underscore) can be interpreted as SGML markup by the parser utility, when the appropriate DTD file is used. An alternative DTD can also be used to parse the transcripts while leaving these characters intact (unprocessed). Please refer to "hub4sgml.doc" for further information and examples of usage. The transcripts also include one special notation in double square brackets: "[[NS]]". This is used to identify a region between two consecutive time stamps in which there is no speech. This typically occurs within a Turn when the speaker pauses for a significant period of time (two seconds or more), during which there is music, background noise or silence. Hyphens are used to indicate word fragments; the hyphen may occur either at the beginning or the end of the fragment. (A word-initial hyphen indicates that noise or transmission problems during the broadcast obscured or eliminated the beginning of the word.) Hyphens are not given any special treatment by the SGML parser, and are passed through to the parser output unmodified.