Home › Language Resources › Data

NIST Meeting Pilot Corpus Transcripts and Metadata

Item Name:	NIST Meeting Pilot Corpus Transcripts and Metadata
Author(s):	John S. Garofolo, Martial Michel, Vincent M. Stanford, Elham Tabassi, Jonathan G. Fiscus, Christophe D. Laprun, Nicolas Pratz, Jerome Lard, Stephanie Strassel
LDC Catalog No.:	LDC2004T13
ISBN:	1-58563-303-8
ISLRN:	682-718-319-529-5
DOI:	https://doi.org/10.35111/dahz-tn26
Release Date:	July 12, 2004
Member Year(s):	2004
DCMI Type(s):	Text
Data Source(s):	meeting speech, microphone conversation
Project(s):	NIST Automatic Meeting Recognition
Application(s):	automatic content extraction, discourse analysis, information retrieval, language modeling, speaker identification, speaker verification, speech recognition
Language(s):	English
Language ID(s):	eng
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2004T13 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Garofolo, John S., et al. NIST Meeting Pilot Corpus Transcripts and Metadata LDC2004T13. Web Download. Philadelphia: Linguistic Data Consortium, 2004.
Related Works: Hide	View isAnnotationOf LDC2004S09 NIST Meeting Pilot Corpus Speech hasOutcome LDC2007S11 2004 Spring NIST Rich Transcription (RT-04S) Development Data LDC2007S12 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data isSimilarWith LDC2004T04 ICSI Meeting Transcripts LDC2004T10 ISL Meeting Transcripts Part 1 relatesTo LDC2008S08 LDC Spoken Language Sampler LDC2013S06 LDC Spoken Language Sampler - Second Release LDC2015S09 LDC Spoken Language Sampler - Third Release

Introduction

NIST Meeting Pilot Corpus Transcripts and Metadata was produced by the Linguistic Data Consortium (LDC) and contains the full speech transcripts created by LDC from about 15 hours of speech as well as a metadata database with useful information about the meeting forums, topics, participants, recording conditions, and equipment. The corresponding speech files are available as the NIST Meeting Pilot Corpus Speech (LDC2004S09). These recordings and transcripts were made for the NIST Automatic Meeting Recognition Project.

Huge efforts are being expended in mining information in newswire, news broadcasts, and conversational speech, however, little has been done to address such applications in the more challenging and equally important meeting domain. Meetings have several important properties not found in other domains, such as being diverse in formality and vocabulary, being highly interactive across multiple participants, using distant microphones, using overlapping camera views, and necessitating multi-media information integration.

The development of smart meeting room core technologies that can automatically recognize and extract important information from multi-media sensor inputs will provide an invaluable resource for a variety of business, academic, and governmental applications.

Data

The data for the NIST Automatic Meeting Recognition Project was collected at the NIST Gaithersburg, MD, Meeting Data Collection Laboratory and includes 19 meetings recorded between November 2001 and December 2003.

The Pilot Corpus contains a total of 15:09:24 of exploitable data. A total of 61 subjects were involved in these meetings. The following is a breakdown by participant origin and sex:

	# Male Instances	# Unique Males	# Female Instances	# Unique Females	Total Participants Instances	Total Unique Participants
Native	54	30	33	15	87	45
Non-Native	18	11	10	5	28	16
Total	72	41	43	20	115	61

The full transcriptions included in this release were created using a "quick" transcription procedure and stored in TXT format. There are approximately 151 K-words (thousands of words) and 6K unique words. A variety of information was manually recorded during the collection of the pilot corpus about the subjects and recording setup. This information was stored in a relational database. An HTML snapshot of the database, done on June 15th, 2004, has been included here under the "metadata" directory.

Samples

Please view the following sample:

Transcript (txt)

Updates

There are no updates available at this time.

NIST Meeting Pilot Corpus Transcripts and Metadata

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees