Home › Language Resources › Data

2003 NIST Rich Transcription Evaluation Data

Item Name:	2003 NIST Rich Transcription Evaluation Data
Author(s):	Jonathan G. Fiscus, George R. Doddington, Audrey Le, Greg Sanders, Mark Przybocki, David Pallett
LDC Catalog No.:	LDC2007S10
ISBN:	1-58563-446-8
ISLRN:	951-213-258-921-8
DOI:	https://doi.org/10.35111/v8j8-m006
Release Date:	August 17, 2007
Member Year(s):	2007
DCMI Type(s):	Sound
Data Source(s):	telephone speech, broadcast news
Language(s):	English, Egyptian Arabic, Standard Arabic, Mandarin Chinese
Language ID(s):	eng, arz, arb, cmn
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2007S10 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Fiscus, Jonathan G., et al. 2003 NIST Rich Transcription Evaluation Data LDC2007S10. Web Download. Philadelphia: Linguistic Data Consortium, 2007.
Related Works: Hide	View isAnnotationOf LDC96S55 CALLFRIEND Mandarin Chinese-Mainland Dialect LDC97S45 CALLHOME Egyptian Arabic Speech LDC2001S13 Switchboard Cellular Part 1 Audio LDC2004S13 Fisher English Training Speech Part 1 Speech hasOutcome LDC2011T06 Broadcast News Lattices isSimilarWith LDC2004S11 2002 Rich Transcription Broadcast News and Conversational Telephone Speech LDC2005S16 RT-04 MDE Training Data Speech LDC2007S11 2004 Spring NIST Rich Transcription (RT-04S) Development Data LDC2007S12 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data LDC2011S06 2005 Spring NIST Rich Transcription (RT-05S) Evaluation Set

Introduction

2003 NIST Rich Transcription Evaluation Data contains the test material used in the 2003 Rich Transcription Spring and Fall evaluations administered by the NIST (National Institute of Standards and Technology) Speech Group. The Spring evaluation (RT-03S), implemented in March-April 2003, focused on Speech-To-Text (STT) tasks for broadcast news speech and conversational telephone speech in three languages: English, Mandarin Chinese and Arabic. That evaluation also included one Metadata Extraction (MDE) task, speaker diarization for broadcast news speech and conversational telephone speech in English. The Fall evaluation (RT-03F), implemented in October 2003, focused on MDE tasks including speaker diarization, speaker-attributed STT, SU (sentence/semantic unit) detection and disfluency detection for broadcast news speech and conversational telephone speech in English. For complete information about the evaluations, see the Rich Text Evaluation website.

Data

The BN datasets were selected from TDT-4 sources collected in February 2001. The evaluation excerpts were transcribed to the nearest story boundary. The English BN dataset is approximately three hours long and is composed of 30-minute excerpts from six different broadcasts. The Mandarin Chinese BN dataset is approximately one hour long, consisting of 12-minute excerpts from five different broadcasts. The Arabic BN dataset is also approximately one hour long and contains 30-minute excerpts from two different broadcasts.

The CTS datasets consist of material from various LDC telephone speech data. All evaluation excerpts were transcribed to the nearest turn. The English CTS set is approximately 6 hours long and is composed of 5-minute excerpts from 72 different conversations: 36 from the Switchboard Cellular collection and 36 from the Fisher collection. The Mandarin Chinese CTS dataset is approximately one hour long and consists of 5-minute excerpts from 12 different conversations from the CallFriend Mandarin Chinese data. The Arabic CTS set is also approximately one hour long and contains 5-minute excerpts from 12 different conversations from the CallHome Egyptian Arabic data.

No manual (human-annotated) segmentations were provided. Sites were required to generate their own segmentations automatically.

Unlike the BN audio files where the full broadcasts were provided, the CTS audio files contain only the evaluation excerpts. Each audio excerpt is a SPHERE-headered, two channel interleaved 8-bit mulaw file.

Samples

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

Copyright

Portions © 2001 American Broadcasting Company, © 2001 Cable News Network, LP, LLLP, © 2001 China Broadcasting System (Taiwan), © 2001 China Central TV, © 2001 China National Radio, © 2001 China Television System (Taiwan), © 2001 National Broadcasting Company, © 2001 Nile TV, © 2001 Public Radio International, © 1996-2005, 2007 Trustees of the University of Pennsylvania

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

2003 NIST Rich Transcription Evaluation Data

Introduction

Data

Samples

Copyright

Available Media

View Fees