Home › Language Resources › Data

1997 Mandarin Broadcast News Transcripts (HUB4-NE)

Item Name:	1997 Mandarin Broadcast News Transcripts (HUB4-NE)
Author(s):	Shudong Huang, Jing Liu, Xuling Wu, Lei Wu, Yongmin Yan, Zhoakai Qin
LDC Catalog No.:	LDC98T24
ISBN:	1-58563-126-4
ISLRN:	915-625-485-899-5
DOI:	https://doi.org/10.35111/qrcj-k950
Member Year(s):	1998
DCMI Type(s):	Text
Data Source(s):	broadcast news
Project(s):	Hub4, GALE, EARS
Application(s):	speech recognition
Language(s):	Mandarin Chinese
Language ID(s):	cmn
License(s):	1997 Mandarin Broadcast News Agreement
Online Documentation:	LDC98T24 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Huang, Shudong, et al. 1997 Mandarin Broadcast News Transcripts (HUB4-NE) LDC98T24. Web Download. Philadelphia: Linguistic Data Consortium, 1998.
Related Works: Hide	View isAnnotationOf LDC98S73 1997 Mandarin Broadcast News Speech (HUB4-NE) hasAnnotation LDC2015S05 Mandarin Chinese Phonetic Segmentation and Tone hasOutcome LDC2007T19 MITRE 1997 Mandarin Broadcast News Speech Translations (HUB-4NE) hasContinuation LDC98T28 1997 English Broadcast News Transcripts (HUB4) LDC98T29 1997 Spanish Broadcast News Transcripts (HUB4-NE) LDC2001S91 1997 HUB4 Broadcast News Evaluation Non-English Test Material LDC2002S11 1997 HUB4 English Evaluation Speech and Transcripts isSimilarWith LDC97T22 1996 English Broadcast News Transcripts (HUB4) LDC2000S86 1998 HUB4 Broadcast News Evaluation English Test Material LDC2000S88 1999 HUB4 Broadcast News Evaluation English Test Material

Introduction

This collection consists of 30 hours of transcripts of Mandarin Chinese broadcast news recordings from the following sources: Voice of America (VOA), China Central TV (CCTV) and KAZN-AM, a commercial radio station based in Los Angeles, CA.

Of these three sources, the first two comprise the bulk of the collection and are represented in roughly equal amounts. Only a relatively small sample of KAZN-AM recordings is included, owing to the relatively high proportion of unusable material in that source(e.g., commercials, local traffic reports).

Corresponding audio files are released as 1997 Mandarin Broadcast News Speech (HUB4-NE) LDC98S73.

Data

The transcripts were created by native speakers of Mandarin working at LDC. They are in GB-encoded form with SGML tags to identify story boundaries, speaker turn boundaries and phrasal pauses. The tags include time stamps to align the text with the speech data. Word segmentation (white-space between words) is included. A working DTD is provided, and the markup is consistent with that of the 1997 English and Spanish HUB4 collections.

Updates

There are no updates at this time.

Additional Licensing Instructions

This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.

1997 Mandarin Broadcast News Transcripts (HUB4-NE)

Introduction

Data

Updates

Additional Licensing Instructions

Copyright

Available Media

View Fees