File: readme.txt NIST Meeting Pilot Corpus Transcripts and Metadata Authors: John Garofolo (john.garofolo@nist.gov) Martial Michel (martial.michel@nist.gov) Vincent Stanford (vincent.stanford@nist.gov) Elham Tabassi (elham.tabassi@nist.gov) Jonathan Fiscus (jonathan.fiscus@nist.gov) Christophe D. Laprun (christophe.laprun@nist.gov) Nicolas Pratz (nicolas.pratz@nist.gov) Jerome Lard Project: NIST Automatic Meeting Recognition Project Project URL: http://www.nist.gov/speech/test_beds/mr_proj/ Applications: automatic content extraction, discourse analysis, information retrieval, language modeling, speaker identification, speaker verification, speech recognition Corpus Structure and Data Attributes: See Sections 2 - 4 1.0 Overview: ------------- This release contains the full speech transcripts created by the Linguistic Data Consortium for the NIST Pilot Meeting Corpus as well as a metadata database with useful information about the meeting forums, topics, participants and recording conditions and equipment. This release (transcripts and metadata database) totals about 2MB. The recorded video and audio data for the corpus are available separately from the Linguistic Data Consortium. The NIST Pilot Meeting corpus was collected at the NIST Gaithersburg, MD Meeting Data Collection Laboratory and includes 19 meetings (comprising about 15 hours of data) recorded between November 2001 and December 2003. Additional information, documentation, and updates pertaining to this data are available at: http://www.nist.gov/speech/test_beds/mr_proj/meeting_corpus_1/ This website contains additional information regarding the collection of the corpus as well as any updates made after this release. Please therefore check this website before working with the source data. Each meeting was recorded using two wireless "personal" mics attached to each meeting participant: a close-talking noise-canceling boom mic and an omni-directional lapel mic). Each meeting was also recorded using 3 omni-directional table mics and a 4-channel directional table mic covering 365 degrees (each channel is contained in a separate file). In addition, each meeting was recorded using five Sony EVI-D30 NTSC cameras: one on each of the 4 walls facing the central conference table and a 5th camera which was used in a scenario-dependent manner (focused on a chosen individual, presenter, whiteboard, or conference table). Information about the data collection setup and microphones is located on the above website. The data files are organized into files/directories identifying the date and time each meeting was recorded (see 4.0 for file naming details). The directory structure in this release is as follows: data/ qtr/ NIST_20011115-1050 NIST_20011211-1054 NIST_20020111-1012 NIST_20020213-1012 NIST_20020214-1148 NIST_20020304-1352 NIST_20020305-1007 NIST_20020627-1010 NIST_20020731-1409 NIST_20020815-1316 NIST_20020904-1322 NIST_20020911-1033 NIST_20021003-1416 NIST_20030623-1409 NIST_20030702-1419 NIST_20030729-1519 NIST_20030925-1517 NIST_20031204-1125 NIST_20031215-1412 metadata/ The pilot corpus is described in: Garofolo, J.S., Laprun, C.D., Michel, M., Stanford, V.M., Tabassi, E., The NIST Meeting Room Pilot Corpus, Proc. LREC 2004, Lisbon, Portugal. This document is included under the 'docs' directory as: LREC04-NIST_MR_Paper.pdf 2.0 Quick Transcriptions: ------------------------- The full transcriptions included in this release were created using a "quick" transcription procedure which is documented in the file "MeetingDataQTRSpec-V1.3.pdf" in the "docs" directory. Note that while these transcriptions should be useful for system training, they have not undergone the strenuous quality control procedures which are employed in producing reference transcriptions for evaluation purposes. A subset of this data has been re-transcribed for the RT-02 and RT-04S evaluations. Those transcripts will be made available at a future date under a separate evaluation dataset release. 3.0 Data Collection Metadata: ----------------------------- A variety of information was manually recorded during the collection of the pilot corpus about the subjects and recording setup. This information was stored in a relational database. An HTML snapshot of the database, done on June 15th 2004, has been included here under the "metadata" directory. A fully-updated online version of the database is available from: http://www.nist.gov/speech/test_beds/mr_proj/meeting_corpus_1/recordings/index.html 4.0 File Naming Conventions: ---------------------------- In this distribution, each meeting was assigned a consistent unique identifier. The naming convention uses a simple meeting identifier consisting of the collection site's name and date and time of recording (in 24-hour format). Each file in this corpus contains a meeting id, which is used in all file names pertaining to a meeting. Filenames within this corpus are constructed by concatenating the meeting ID with a microphone type identifier along with the original site subject id as follows: MEETING_FILE :== __.[txt|html] where, MEETING_ID :== _ where, RECORDING_LOCATION :== "NIST" RECORDING_START_TIME :== - DEV_ID := "NONE" SUBJECT_ID :== "NONE" Note: the files for each meeting are stored under a directory named "NIST_" 5.0 Content of 'data' directory: -------------------------------- 5.1 Directory 'qtr': -------------------- Directory content total size: 1.4M 5.1.1 Directory 'NIST_20011115-1050': ------------------------------------- Meeting duration: 17:52 Directory content total size: 29K Filename: NIST_20011115-1050_NONE_NONE.txt File Type: Quick Transcript Text for meeting NIST_20011115-1050 File Size: 28K 5.1.2 Directory 'NIST_20011211-1054': ------------------------------------- Meeting duration: 34:49 Directory content total size: 57K Filename: NIST_20011211-1054_NONE_NONE.txt File Type: Quick Transcript Text for meeting NIST_20011211-1054 File Size: 56K 5.1.3 Directory 'NIST_20020111-1012': ------------------------------------- Meeting duration: 27:53 Directory content total size: 53K Filename: NIST_20020111-1012_NONE_NONE.txt File Type: Quick Transcript Text for meeting NIST_20020111-1012 File Size: 52K 5.1.4 Directory 'NIST_20020213-1012': ------------------------------------- Meeting duration: 1:09:07 Directory content total size: 121K Filename: NIST_20020213-1012_NONE_NONE.txt File Type: Quick Transcript Text for meeting NIST_20020213-1012 File Size: 120K 5.1.5 Directory 'NIST_20020214-1148': ------------------------------------- Meeting duration: 54:08 Directory content total size: 101K Filename: NIST_20020214-1148_NONE_NONE.txt File Type: Quick Transcript Text for meeting NIST_20020214-1148 File Size: 100K 5.1.6 Directory 'NIST_20020304-1352': ------------------------------------- Meeting duration: 50:54 Directory content total size: 69K Filename: NIST_20020304-1352_NONE_NONE.txt File Type: Quick Transcript Text for meeting NIST_20020304-1352 File Size: 68K 5.1.7 Directory 'NIST_20020305-1007': ------------------------------------- Meeting duration: 53:10 Directory content total size: 81K Filename: NIST_20020305-1007_NONE_NONE.txt File Type: Quick Transcript Text for meeting NIST_20020305-1007 File Size: 80K 5.1.8 Directory 'NIST_20020627-1010': ------------------------------------- Meeting duration: 40:38 Directory content total size: 65K Filename: NIST_20020627-1010_NONE_NONE.txt File Type: Quick Transcript Text for meeting NIST_20020627-1010 File Size: 64K 5.1.9 Directory 'NIST_20020731-1409': ------------------------------------- Meeting duration: 1:00:25 Directory content total size: 97K Filename: NIST_20020731-1409_NONE_NONE.txt File Type: Quick Transcript Text for meeting NIST_20020731-1409 File Size: 96K 5.1.10 Directory 'NIST_20020815-1316': -------------------------------------- Meeting duration: 55:46 Directory content total size: 73K Filename: NIST_20020815-1316_NONE_NONE.txt File Type: Quick Transcript Text for meeting NIST_20020815-1316 File Size: 72K 5.1.11 Directory 'NIST_20020904-1322': -------------------------------------- Meeting duration: 38:10 Directory content total size: 53K Filename: NIST_20020904-1322_NONE_NONE.txt File Type: Quick Transcript Text for meeting NIST_20020904-1322 File Size: 52K 5.1.12 Directory 'NIST_20020911-1033': -------------------------------------- Meeting duration: 35:59 Directory content total size: 41K Filename: NIST_20020911-1033_NONE_NONE.txt File Type: Quick Transcript Text for meeting NIST_20020911-1033 File Size: 40K 5.1.13 Directory 'NIST_20021003-1416': -------------------------------------- Meeting duration: 58:16 Directory content total size: 69K Filename: NIST_20021003-1416_NONE_NONE.txt File Type: Quick Transcript Text for meeting NIST_20021003-1416 File Size: 68K 5.1.14 Directory 'NIST_20030623-1409': -------------------------------------- Meeting duration: 59:56 Directory content total size: 85K Filename: NIST_20030623-1409_NONE_NONE.txt File Type: Quick Transcript Text for meeting NIST_20030623-1409 File Size: 84K 5.1.15 Directory 'NIST_20030702-1419': -------------------------------------- Meeting duration: 1:05:31 Directory content total size: 109K Filename: NIST_20030702-1419_NONE_NONE.txt File Type: Quick Transcript Text for meeting NIST_20030702-1419 File Size: 108K 5.1.16 Directory 'NIST_20030729-1519': -------------------------------------- Meeting duration: 23:34 Directory content total size: 45K Filename: NIST_20030729-1519_NONE_NONE.txt File Type: Quick Transcript Text for meeting NIST_20030729-1519 File Size: 44K 5.1.17 Directory 'NIST_20030925-1517': -------------------------------------- Meeting duration: 40:08 Directory content total size: 65K Filename: NIST_20030925-1517_NONE_NONE.txt File Type: Quick Transcript Text for meeting NIST_20030925-1517 File Size: 64K 5.1.18 Directory 'NIST_20031204-1125': -------------------------------------- Meeting duration: 52:57 Directory content total size: 81K Filename: NIST_20031204-1125_NONE_NONE.txt File Type: Quick Transcript Text for meeting NIST_20031204-1125 File Size: 80K 5.1.19 Directory 'NIST_20031215-1412': -------------------------------------- Meeting duration: 1:10:11 Directory content total size: 109K Filename: NIST_20031215-1412_NONE_NONE.txt File Type: Quick Transcript Text for meeting NIST_20031215-1412 File Size: 108K 5.2 Directory 'metadata': ------------------------- Directory content total size: 419K Filename: index.html File Type: HTML index file, contains a table listing all available meeting meta data information available File Size: 12K Filename: NIST_20011115-1050_NONE_NONE.html File Type: HTML file related to meeting NIST_20011115-1050 File Size: 12K Filename: NIST_20011211-1054_NONE_NONE.html File Type: HTML file related to meeting NIST_20011211-1054 File Size: 8K Filename: NIST_20020111-1012_NONE_NONE.html File Type: HTML file related to meeting NIST_20020111-1012 File Size: 12K Filename: NIST_20020213-1012_NONE_NONE.html File Type: HTML file related to meeting NIST_20020213-1012 File Size: 12K Filename: NIST_20020214-1148_NONE_NONE.html File Type: HTML file related to meeting NIST_20020214-1148 File Size: 12K Filename: NIST_20020304-1352_NONE_NONE.html File Type: HTML file related to meeting NIST_20020304-1352 File Size: 12K Filename: NIST_20020305-1007_NONE_NONE.html File Type: HTML file related to meeting NIST_20020305-1007 File Size: 12K Filename: NIST_20020627-1010_NONE_NONE.html File Type: HTML file related to meeting NIST_20020627-1010 File Size: 12K Filename: NIST_20020731-1409_NONE_NONE.html File Type: HTML file related to meeting NIST_20020731-1409 File Size: 12K Filename: NIST_20020815-1316_NONE_NONE.html File Type: HTML file related to meeting NIST_20020815-1316 File Size: 12K Filename: NIST_20020904-1322_NONE_NONE.html File Type: HTML file related to meeting NIST_20020904-1322 File Size: 12K Filename: NIST_20020911-1033_NONE_NONE.html File Type: HTML file related to meeting NIST_20020911-1033 File Size: 12K Filename: NIST_20021003-1416_NONE_NONE.html File Type: HTML file related to meeting NIST_20021003-1416 File Size: 12K Filename: NIST_20030623-1409_NONE_NONE.html File Type: HTML file related to meeting NIST_20030623-1409 File Size: 12K Filename: NIST_20030702-1419_NONE_NONE.html File Type: HTML file related to meeting NIST_20030702-1419 File Size: 8K Filename: NIST_20030729-1519_NONE_NONE.html File Type: HTML file related to meeting NIST_20030729-1519 File Size: 12K Filename: NIST_20030925-1517_NONE_NONE.html File Type: HTML file related to meeting NIST_20030925-1517 File Size: 12K Filename: NIST_20031204-1125_NONE_NONE.html File Type: HTML file related to meeting NIST_20031204-1125 File Size: 12K Filename: NIST_20031215-1412_NONE_NONE.html File Type: HTML file related to meeting NIST_20031215-1412 File Size: 12K 5.2.1 Directory 'image': ------------------------ Directory content total size: 169K Filename: favicon.jpg File Type: JPEG image file in support of HTML metadata File Size: 4K Filename: link.gif File Type: GIF image file in support of HTML metadata File Size: 4K Filename: mr_room_layout.v0.jpg File Type: JPEG image file in support of HTML metadata File Size: 80K Filename: mr_room_layout.v1.jpg File Type: JPEG image file in support of HTML metadata File Size: 76K Filename: nist_webid.gif File Type: GIF image file in support of HTML metadata File Size: 4K 5.2.2 Directory 'styles': ------------------------- Directory content total size: 17K Filename: main.css File Type: Style Sheet file in support of HTML metadata File Size: 4K Filename: metadata.css File Type: Style Sheet file in support of HTML metadata File Size: 4K Filename: print.css File Type: Style Sheet file in support of HTML metadata File Size: 4K Filename: rowhiliter.js File Type: Java Script in support of HTML index file File Size: 4K 6.0 Notes: ---------- The audio portion of this corpus is being used in the Rich Transcription evaluation series of Speech-to-Text Transcription and metadata extraction task evaluations. Therefore, care should be taken when subsetting this data for combination audio/video experiments. Two of the meetings in this corpus, NIST_20020214-1148 and NIST_20020305-1007, were used in the RT-02 Speech-to-Text Transcription and Speaker Segmentation evaluations. This data also constitutes a portion of RT-04S development test set. Two of the meetings in this corpus, NIST_20030623-1409 and NIST_20030925-1517, were used in the RT-04S Speech-to-Text Transcription and Speaker Segmentation evaluations. This data constitutes a portion of the RT-04S evaluation test set. As of this date (June 17, 2004), this data has yet to be partitioned for use in video extraction development and evaluation. Research sites working with this data in both video and audio recognition tasks should be sure to use only the proper datasets in their training and development. Questions should be addressed to john.garofolo@nist.gov and martial.michel@nist.gov.