GALE Phase 3 Chinese Broadcast Conversation Transcripts - Part 1 Linguistic Data Consortium Authors: Meghan Glenn, Haejoong Lee, Stephanie Strassel, Kazuaki Maeda 1 Introduction This release comprises part 1 of GALE P3 Chinese Broadcast Conversation Transcripts. Transcripts included in this release were created by LDC to support the GALE Program sponsored by DARPA. An annotation tool "Xtrans" was developed at LDC to support the transcription task. Corresponding audio data is released seperately. 2 Data Sources The broadcast conversation recordings for transcription feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Anhui TV, a regional television station in Mainland China, Anhui Province; Beijing TV, a national television station in Mainland China; China Central TV (CCTV), a national and international broadcaster in Mainland China; Hubei TV, a regional television station in Mainland China, Hubei Province; and Phoenix TV, a Hong Kong-based satellite television station. 3 Data Profile Language Data-type Genre Files Tokens Time(seconds) ---------------------------------------------------------------- Chinese(Mandarin) text BC 217 1556904 402958.86 BC: stands for broadcast conversation There may be overlap between BN and BC content in a particular audio file. Our classification of a source program as BN or BC is meant to reflect the dominant genre. Token count is based on Chinese characters. 4 Transcription Annotation 4.1 Annotation Process and Guidelines Data go through one or more than one layer of annotation based on the type of transcription performed (described below). Regardless of the transcription method or data genre, data are either transcribed in-house or outsourced. This is determined based on timeline, volume and available resources. Whatever the case may be, the following guidelines are followed consistently. Quick transcripts (QTR): quick (near-)verbatim, time-aligned transcripts plus speakerID with minimal additional markup; created by LDC and/or professional transcription agencies. Transcripts do not include SU (sentence-unit) annotations. Data are either fully outsourced for one pass of transcription, with no additional annotation performed; or are fully transcribed in-house for one pass. Quick rich transcripts (QRTR): quick (near-)verbatim, time-aligned transcripts with minimal markup, plus speakerID and SU (sentence-unit) annotations; created by LDC and/or professional transcription agencies. First pass annotation is either outsourced or completed in-house. Second (and in some cases third) pass transcriptions are needed for this type of data, regardless of location of first pass annotation. Second (and sometimes very quick third) pass annotation is always completed in-house by more senior annotators to ensure quality. Transcribed data with QTR as part of filenames indicate the transcription annotation is performed in quick transcription style, while filenames containing QRTR indicating a rich and careful transription annotation style. Copies of transcription guidelines are included in the docs directory. 4.2 Annotation Tool Transcription annotation is done via the tool XTrans, which is a next generation multi-platform, multilingual, multi-channel transcription tool developed by LDC to support manual transcription and annotation of audio recordings. Designed with input from experienced human transcribers working with real world data, XTrans provides a flexible and intuitive graphical user interface for a multitude of speech annotation tasks including (virtual) segmentation of audio into smaller units like turns and sentences; speaker identification; orthographic transcription in any language; and labeling of structural elements of the transcript like topics. The tool is free-downloadable from the following link: https://www.ldc.upenn.edu/language-resources/tools/xtrans 5 Data format 5.1 Transcription file name conventions Transcription files are named as follows. __ARB__