Home › Language Resources › Data

HUB5 Mandarin Telephone Speech and Transcripts Second Edition

Item Name:	HUB5 Mandarin Telephone Speech and Transcripts Second Edition
Author(s):	Linguistic Data Consortium
LDC Catalog No.:	LDC2018S18
ISBN:	1-58563-867-6
ISLRN:	299-779-903-540-2
DOI:	https://doi.org/10.35111/4js2-xd38
Release Date:	December 17, 2018
Member Year(s):	2018
DCMI Type(s):	Sound, Text
Sample Type:	ulaw
Sample Rate:	8000
Data Source(s):	telephone conversations
Project(s):	EARS, GALE, Hub5-LVCSR
Application(s):	speech recognition
Language(s):	Mandarin Chinese
Language ID(s):	cmn
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2018S18 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Linguistic Data Consortium. HUB5 Mandarin Telephone Speech and Transcripts Second Edition LDC2018S18. Web Download. Philadelphia: Linguistic Data Consortium, 2018.
Related Works: Hide	View isVersionOf LDC98S69 HUB5 Mandarin Telephone Speech Corpus LDC98T26 HUB5 Mandarin Transcripts isPartOf LDC2025S04 BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Audio isPartWith LDC96S34 CALLHOME Mandarin Chinese Speech LDC96S55 CALLFRIEND Mandarin Chinese-Mainland Dialect LDC98S69 HUB5 Mandarin Telephone Speech Corpus LDC2002S12 2001 HUB5 Mandarin Evaluation LDC2018S09 CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition isAnnotationOf LDC2025S04 BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Audio isOutcomeOf LDC96S55 CALLFRIEND Mandarin Chinese-Mainland Dialect LDC2018S09 CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition hasContinuation LDC98S70 HUB5 Spanish Telephone Speech Corpus LDC98T27 HUB5 Spanish Transcripts isSimilarWith LDC2002S09 2000 HUB5 English Evaluation Speech LDC2002S10 1998 HUB5 English Evaluation LDC2002S12 2001 HUB5 Mandarin Evaluation LDC2002S13 2001 HUB5 English Evaluation LDC2002S22 1997 HUB5 Arabic Evaluation LDC2002S23 1997 HUB5 English Evaluation LDC2002S24 1997 HUB5 German Evaluation LDC2002S25 1997 HUB5 Spanish Evaluation LDC2002T39 1997 HUB5 Arabic Transcripts LDC2002T43 2000 HUB5 English Evaluation Transcripts LDC2003T01 2001 HUB5 Mandarin Transcripts LDC2003T02 1998 HUB5 English Transcripts LDC2003T03 1997 HUB5 German Transcripts LDC2003T04 1997 HUB5 Spanish Transcripts

Introduction

HUB5 Mandarin Telephone Speech and Transcripts Second Edition was developed by the Linguistic Data Consortium (LDC) in support of US government projects for language recognition and Large Vocabulary Conversational Speech Recognition (LVCSR). The first edition was released by LDC in two data sets, HUB5 Mandarin Telephone Speech Corpus (LDC98S69) and HUB5 Mandarin Transcripts (LDC98T26). This second edition merges the speech and transcript releases, updates the audio format and adds Pinyin transcripts, forced alignment and updated documentation and metadata.

Data

This release consists of (1) approximately 19 hours of Mandarin speech from 42 unscripted telephone conversations between native speakers of Mandarin from CALLFRIEND Mandarin Chinese-Mainland Dialect (LDC96S55), which has also been released in a second, updated edition (LDC2018S09) and (2) associated transcripts of contiguous 5-30 minute segments from those telephone conversations.

Audio data was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations lasted up to 30 minutes.

The audio data was recorded as 8kHz u-law SPH encoded stereo files with one end of the phone call on each channel. In this release, files were converted to WAV format, and information from the original SPH headers is included with the corpus. SPH files are not included in this second edition.

Completed calls passed through two human audits. The first audit was conducted to verify that the target language was spoken by the participants and to check the quality of the recordings. The second audit was conducted by a native speaker familiar with Mainland and Taiwan Mandarin dialects to classify the conversations under one of the two categories. Audit information is available in in the corpus documentation.

Transcripts were created manually by native Mandarin speakers in the GB2312 encoding schema. This release adds Pinyin translations of the transcripts in UTF-8 and includes the original transcripts converted to UTF-8. For forced alignment, files were converted to linear-PCM encoding, and the speaker channels were split into separate files to avoid overlapping. The aligned files are presented in tab-separated files and in TextGrid files. Alignment data is provided in UTF-8.

HUB5 Mandarin Telephone Speech and Transcripts Second Edition

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees