NIST Meeting Pilot Corpus Transcripts and Metadata

Item Name: NIST Meeting Pilot Corpus Transcripts and Metadata
Author(s): John S. Garofolo, Martial Michel, Vincent M. Stanford, Elham Tabassi, Jonathan G. Fiscus, Christophe D. Laprun, Nicolas Pratz, Jerome Lard, Stephanie Strassel
LDC Catalog No.: LDC2004T13
ISBN: 1-58563-303-8
ISLRN: 682-718-319-529-5
Release Date: July 12, 2004
Member Year(s): 2004
DCMI Type(s): Text
Data Source(s): meeting speech, microphone conversation
Project(s): NIST Automatic Meeting Recognition
Application(s): automatic content extraction, discourse analysis, information retrieval, language modeling, speaker identification, speaker verification, speech recognition
Language(s): English
Language ID(s): eng
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2004T13 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Garofolo, John S., et al. NIST Meeting Pilot Corpus Transcripts and Metadata LDC2004T13. Web Download. Philadelphia: Linguistic Data Consortium, 2004.
Related Works: View


NIST Meeting Pilot Corpus Transcripts and Metadata was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T13 and ISBN 1-58563-303-8.

This corpus contains the full speech transcripts created by the Linguistic Data Consortium for the NIST Automatic Meeting Recognition Project as well as a metadata database with useful information about the meeting forums, topics, participants and recording conditions and equipment. The corresponding speech files are available as the NIST Meeting Pilot Corpus Speech, while the video files will be published later as NIST Meeting Pilot Corpus Video.

For more information, documentation, and updates made after the release of this corpus, please consult the NIST project website for the corpus.


The data for the NIST Automatic Meeting Recognition Project was collected at the NIST Gaithersburg, MD Meeting Data Collection Laboratory and includes 19 meetings (comprising about 15 hours of data) recorded between November 2001 and December 2003.

The full transcriptions included in this release were created using a "quick" transcription procedure. There are ~151K-words and 6K unique words. A variety of information was manually recorded during the collection of the pilot corpus about the subjects and recording setup. This information was stored in a relational database. A fully-updated online version of the database is available from the NIST project website.


There are no updates available at this time.

Available Media

View Fees

Login for the applicable fee