1997 Mandarin Broadcast News Transcripts (HUB4-NE)


Item Name: 1997 Mandarin Broadcast News Transcripts (HUB4-NE)
Authors: Shudong Huang, Jing Liu, Xuling Wu, Lei Wu, Yongmin Yan, and Zhoakai Qin
LDC Catalog No.: LDC98T24
ISBN: 1-58563-126-4
Data Type: text
Data Source(s): broadcast news
Project(s): EARS, GALE, Hub4
Application(s): speech recognition
Language(s): Mandarin Chinese
Distribution: Web Download
Member fee: $0 for 1998 members
Non-member Fee: N/A (Members Only)
Reduced-License Fee: N/A
Extra-Copy Fee: N/A
Member License: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Shudong Huang, et al.
1998
1997 Mandarin Broadcast News Transcripts (HUB4-NE)
Linguistic Data Consortium, Philadelphia

Introduction

This collection consists of 30 hours of transcripts of Mandarin Chinese broadcast news recordings from the following sources: Voice of America (VOA), China Central TV (CCTV) and KAZN-AM, a commercial radio station based in Los Angeles, CA.

Of these three sources, the first two comprise the bulk of the collection and are represented in roughly equal amounts. Only a relatively small sample of KAZN-AM recordings is included, owing to the relatively high proportion of unusable material in that source(e.g., commercials, local traffic reports).

Corresponding audio files are released as 1997 Mandarin Broadcast News Speech (HUB4-NE) LDC98S73.

Data

The transcripts were created by native speakers of Mandarin working at LDC. They are in GB-encoded form with SGML tags to identify story boundaries, speaker turn boundaries and phrasal pauses. The tags include time stamps to align the text with the speech data. Word segmentation (white-space between words) is included. A working DTD is provided, and the markup is consistent with that of the 1997 English and Spanish HUB4 collections.

Updates

There are no updates at this time.

Copyright

Portions 1997 China Central TV, 1997 MultiCultural Broadcasting Corporation, 1997, 1998 Trustees of the University of Pennsylvania

Pricing

The Reduced Licensing Fee for this corpus is US$100.