2006 NIST Spoken Term Detection Evaluation Set


Introduction

2006 NIST Spoken Term Detection Evaluation Set, Linguistic Data Consortium (LDC) catalog number LDC2011S03 and isbn 1-58563-584-7, was compiled by researchers at NIST (National Institute of Standards and Technology) and contains Arabic, Chinese and English broadcast news, English conversational telephone speech and English meeting room speech used in NIST's 2006 Spoken Term Detection (STD) evaluation. The STD initiative is designed to facilitate research and development of technology for retrieving information from archives of speech data with the goals of exploring promising new ideas in spoken term detection, developing advanced technology incorporating these ideas, measuring the performance of this technology and establishing a community for the exchange of research results and technical insights.

The 2006 STD task was to find all of the occurrences of a specified "e;term"e; (a sequence of one or more words) in a given corpus of speech data. The evaluation was intended to develop technology for rapidly searching very large quantities of audio data. Although the evaluation used modest amounts of data, it was structured to simulate the very large data situation and to make it possible to extrapolate the speed measurements to much larger data sets. Therefore, systems were implemented in two phases: indexing and searching. In the indexing phase, the system processes the speech data without knowledge of the terms. In the searching phase, the system uses the terms, the index, and optionally the audio to detect term occurrences.

Data

As with the development corpus, the evaluation corpus consists of roughly 18 hours of speech in three data genres: broadcast news (BNews), conversational telephone speech (CTS) and conference room meetings (CONFMTG). The broadcast news material was collected in 2001 by LDC's broadcast collection system from the following sources: ABC (English), China Broadcasting System (Chinese), China Central TV (Chinese), China National Radio (Chinese), China Television System (Chinese), CNN (English), MSNBC/NBC (English), Nile TV (Arabic), Public Radio International (English) and Voice of America (Arabic, Chinese, English). The CTS data was taken from the Switchboard data sets (e.g., Switchboard-2 Phase 1 LDC98S75, Switchboard-2 Phase 2 LDC99S79) and the Fisher corpora (e.g., Fisher English Training Sppech Part 1 LDC2004S13), also collected by LDC. The conference room meeting material consists of goal-oriented, small group roundtable meetings and was collected in 2001, 2004 and 2005 by NIST, the International Computer Science Institute (Berkely, California), Carnegie Mellon University (Pittsburgh, PA) and Virginia Polytechnic Institute and State University (Blacksburg, VA) as part of the AMI corpus project. This evaluation corpus includes scoring software. It uses the inputs described in the STD Evaluation plan to complete the evaluation of a system.

Each BNews recording is a 1-channel, pcm-encoded, 16Khz, SPHERE formatted file. CTS recordings are 2-channel, u-law encoded, 8 Khz, SPHERE formatted files. TheCONFMTG files contain a single recorded channel.

The indices directory contains files useful for replicating the evaluation. The System Input Experimental Control Files define the full extent of the excerpts to be processed by the systems; the Scoring Input Experimental Control Files define the extent of scorable material in the test data and contain excerpts defining the evaluable regions of the recordings; and the System Input Term Lists define the terms to be searched by a system.

Please see file.tbl for the directory structure of this publication, as well as a complete list of files.

Please go to data for a listing of data files.

Other documentation files are:

The Scoring software and documenation may be sound in the software directory.

Updates

Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2011S03.

Content Copyright

Portions © 2001 American Broadcasting Corporation, Cable News Network, LP, LLP, China Broadcasting System, China Central TV, China National Radio, China Television System, Nile TV, National Broadcasting Company, Public Radio International, © 1998, 1999, 2004, 2011 Trustees of the University of Pennsylvania

Contact: ldc@ldc.upenn.edu
© 2011 Linguistic Data Consortium, Trustees of the University of Pennsylvania. All Rights Reserved.