|LDC98S73 - Speech data LDC98T24 - Transcripts |
This collection consists of 30 hours of recorded broadcasts and transcripts that have been drawn from the following sources:
Voice of America (VOA): United States Information Agency Radio People's Republic of China Television (CCTV) Commercial radio based in Los Angeles, CA. (KAZN-AM)
Of these three sources, the first two comprise the bulk of the collection and are represented in roughly equal amounts; only a relatively small sample of KAZN-AM recordings are included, owing to the relatively high proportion of unusable material (commercials, local traffic reports loaded with California place names, etc.).
The transcripts were created by native speakers of Mandarin working at the LDC; they are in GB-encoded form, with SGML tagging to identify story boundaries, speaker turn boundaries and phrasal pauses; these tags include time stamps to align the text with the speech data. Word segmentation (white-space between words) is included. A working DTD is provided, and the markup is consistent with that of the 1997 English and Spanish HUB4 collections.
Updates There are no updates at this time.
Copyright Portions © 1997 China Central TV, © 1997 MultiCultural Broadcasting Corporation, © 1997, 1998 Trustees of the University of Pennsylvania
The Reduced Licensing Fee for this corpus is US$100.