Title: VAST Chinese Transcription Authors: Jennifer Tracey, Stephanie Strassel, Neil Kuster 1. Introduction This corpus consists of transcripts and audio for Mandarin Chinese data harvested from the Internet for the Video Annotation for Speech Technologies (VAST) project. The aim of the VAST project was to collect and annotate data in several languages to support the development of speech technologies such as speech activity detection, language identification, speaker identification, and speech recognition. The collection was designed to ensure that the audio covers a wide range of speakers, communication domains, noise environments, and data sources. The data included in this corpus comprises the subset of files selected for transcription from the larger pool of Chinese files collected during the project. 2. Data Profile The data included in this corpus consists of approximately 29 hours of audio. There are 566 audio files with an average duration of 186 seconds. A list of files and their durations (filelist.txt) can be found in the docs/ directory of this corpus. 3. Directory Structure and Data Formats data/ - directory containing data files transcript/ - subdirectory containing transcription files audio/ - subdirectory containing audio files docs/ - directory containing documentation README.txt - this file filelist.txt - list of audio files and durations checksums.txt - list of audio files and their checksums Cmn_CTS_transcription_guidelines_V1.5.pdf - transcription guidelines All audio files are in .flac format. Transcription files are in .tdf format, a standard tab delimited format with the following column headings: file - name of the source file annotated/transcribed channel - all audio is single channel, so channel is always 0 start - start time of segment end - end time of segment speaker - speakerID is speaker1, speaker2, etc. speakerType - gender label associated with speakerID: male, female, child or unknown speakerDialect - dialect label associated with speaker ID (not part of transcription specification, so labels may not be reliable) transcript - contains transcript text section - not used (always "0") turn - not used (always "0") segment - sequential numbering of segments in each file sectionType - not used (always "conversational") suType - not part of transcription specification, so labels ("statement", "question", or "incomplete" may not be reliable) 4. Source Data All audio transcribed is extracted from amateur video content harvested from the web. Videos were selected for harvesting by "data scouts" who were native speakers of Mandarin Chinese. The scouts searched for video content in their language that met the following criteria: - contains speech in Mandarin - multi-party, informal speech preferred over monologues, telephone-style dialogues, or interviews (but the less-preferred styles were allowed, and should be expected to occur in the transcribed data) - topic is unrestricted, with variety in topics preferred Scouting was performed using a custom user interface called VScout. The VScout toolkit is a Firefox add-on consisting of an annotation form displayed on the left side of the browser window. Annotators use the browser in the usual way to search for and view videos that are suitable for inclusion in the corpus. Once they find a video, they answer questions in the annotation form to provide information about the video, including URL, language, number of speakers, sound conditions, and speaker overlap. The URL recorded in the database via the annotation form is then used to harvest the video and extract the audio for transcription. 5. Transcription 5.1 Process and Guidelines The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. A targeted second pass was used to check for frequent or egregious errors, correct use of transcription conventions, and to add marking for proper names. An automated quality control pass was used to check for adherence to transcript conventions (no unexpected symbols used), consistency of formatting, and structural integrity in the .tdf files. The transcription was performed in accordance with a Quick-Rich Transcription style, using guidelines developed for use on telephone speech in another program. The annotators were told to ignore references in the guidelines to 2-channel audio, as all audio in this corpus was single-channel. Notation for proper names (^name) was not part of the transcription guidelines but was added in the second pass described above. 5.2 Transcription Tool The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. XTrans is available from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans. 6. Contact Information If you have questions about this data release, please contact the following personnel at LDC. Jennifer Tracey - VAST Project Manager Stephanie Strassel - VAST PI Christopher Caruso - VAST Technical Lead ------ README created by Jennifer Tracey on April 5, 2018