Documentation for the MITRE 1997 Mandarin Broadcast News Speech Translations (Hub-4NE) Corpus ------------------------------------------------------ This distribution contains segment-aligned English translations of the 1997 DARPA HUB4-NE Mandarin transcripts. The original transcripts are available as separate LDC publication (LDC98T24), as is the original audio (LDC98S73). The Mandarin side of these aligned transcripts is identical to the original transcripts, aside from segmentation. The original transcript segmentation was suitable for speech recognition, but does not support machine translation and machine translation evaluation. The resegmentation is detailed below. Sources for the original audio are listed and described in the original dataset releases. This dataset consists of 376K words of English text and 517K characters of Mandarin text. The English was produced by translators with no access to the original audio. The translators were given specific guidelines for translation, and those are included in this distribution. 6% of the source data was translated four times in order to support experiments in translation evaluation. Each translator's version of these documents are marked with underscores followed by a numeric identifier, 1-4. Resegmentation to support translation ------------------------------------- Speech recognition can be performed in small chunks by humans, chunks of only a dozen or so words at a time. Thus the original transcripts could be aligned with the accompanying audio without regard for sentence boundaries, simply by inserting timestamps into the transcripts indicating that a particular point in the audio signal was aligned between a pair of words. Translation, however, is performed at least one whole sentence at a time. Asking translators to show how points align across same texts in differing languages, however, is an expensive prospect, and could furthermore detrimentally bias the translations themselves. It is not entirely uncommon to ask translators to translate *segments*, of at least one sentence in length in a chunk. To enable alignment of the resulting translations with the original audio signal, sentence-internal timestamps indicating alignment to the original audio must be disregarded. Prior to translation, all such timestamps were removed, yielding segments much larger than in the original speech corpus, but which can be translated in-order and without difficulty by humans. An Example File (from Spanish corpus of same type) -------------------------------------------------- A C