Broadcast News Lattices


Item Name: Broadcast News Lattices
Authors: Geoffrey Zweig, Damianos Karakos and Patrick Nguyen
LDC Catalog No.: LDC2011T06
ISBN: 1-58563-578-2
Release Date: Apr 15, 2011
Data Type: text
Data Source(s): broadcast news
Application(s): speech recognition
Language(s): English
Language ID(s): eng
Distribution: Web Download
Member fee: $0 for 2011 members
Non-member Fee: US $1000.00
Reduced-License Fee: US $500.00
Extra-Copy Fee: N/A
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Geoffrey Zweig, Damianos Karakos and Patrick Nguyen
2011
Broadcast News Lattices
Linguistic Data Consortium, Philadelphia

Introduction

Broadcast News Lattices, Linguistic Data Consortium (LDC) catalog number LDC2011T06 and isbn 1-58563-578-2, was developed by researchers at Microsoft and Johns Hopkins Unviersity (JHU) for the Johns Hopkins 2010 Summer Workshop on Speech Recognition with Conditional Random Fields. The lattices were generated using the IBM Attila speech recognition toolkit and were derived from transcripts of approximately 400 hours of English broadcast news recordings. They are intended to be used for training and decoding with Microsofts segmental CRF toolkit for speech recogntion, SCARF.

The goal of the JHU 2010 workshop was to advance the state-of-the-art in core speech recognition by developing new kinds of features for use in a Segmental Conditional Random Field (SCRF). The SCRF approach generalizes Condtional Random Fields to operate at the segment level, rather than at the traditional frame level. Every segment is labeled directly with a word. Features are then extracted which each measure some form of consistency between the underlying audio and the word hypothesis for a segment. These are combined in a log-linear model (lattice) to produce the posterior possibility of a word sequence given the audio.

Data

Broadcast News Lattices consists of training and test material, the source data for which was taken from various corpora distributed by LDC.

Training Data The training lattices total 152251 and were derived from the following data sets:

1996 English Broadcast News Speech LDC97S44 1996 English Broadcast News Transcripts (HUB4) LDC97T22 (104 hours)

1997 English Broadcast News Speech (HUB4) LDC98S71 1997 English Broadcast News Transcripts (HUB4) LDC98T28 (97 hours)

TDTD4 Multilingual Broadcast News Speech Corpus LDC2005S11 TDT4 Multilingual Text and Annotations LDC2005T16 (300 hours)

The lattices can be related to the original audio files via the file train.db.gz which lists for each segment a tag-name, segment number, the original audio file, channel (always 0), start time, and end time (in seconds). A sample line is as follows:

19960510_NPR_ATC#Ailene_Leblanc 0001 19960510_NPR_ATC.sph 0 76.767 89.404 | This sample line corresponds to the release lattice labeled:

19960510_NPR_ATC#Ailene_Leblanc@0001.dc

The file train.Bdc contains denominator lattices. The file train.Bnc has the numerator lattices containing the subset of paths consistent with the training transcriptions. The file train.Btr consists of the transcriptions. The file train.Bbase contains the baseline (one-best) word detections from the Attila system. The lattices were generated from an acoustic model that included LDA+MLLT, VTLN, fMLLR based SAT training, fMMI and mMMI discriminative training, and MLLR. The lattices are annotated with a field indicating the results of a second confirmatory decoding made with an independent speech recognizer. When there was a correspondence between a lattice link and the 1-best secondary output, the link was annotated with +1. Silence links are denominated with 0 and all others with -1. Correspondence was computed by finding the midpoint of a lattice link and comparing the link label with that of the word in the secondary decoding at that position. Thus, there are some cases where the same word shifted slightly in time receives a different confirmation score.

Test Data The test lattices are derived from the English broadcast news material in 2003 NIST Rich Transcription Evaluation Data LDC2007S10. Bbase and Bdc files are provided, along with the db file rt03.db.gz to link the segments to times in the original waveform. Scoring scripts may be obtained from the NIST Rich Transcription website.

SCARF Toolkit

The SCARF toolkit is available for download from the SCARF website.

Related Publications

A full description of the lattice generation process can be found in Zweig et al., Speech Recognition with Segmental Conditional Random Fields: Final Report from the 2010 JHU Summer Workshop, MSR Technical Report MSR-TR-2010-173.

Updates

Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2011T06.

Samples

Source Denominator Lattices
20010206_1830_1900_ABC_WNT#aaron_brown@0001.base # baseline 2 A 5 HALF 20 CENTURY 56 AGO 95 LORRAINE 132 WAGNER 175 WAS 207 A 219 KID 239 WITH 263 A 270 CRUSH 300 THE 376 OBJECT 416 OF 446 HER 458 AFFECTION 497 AND 565 HER 583 CONSIDERABLE 637 ATTENTION 716 WAS 817 A 826 HUNKY 847 YOUNG 880 ACTOR 909 NAMED 934 RONALD 960 REAGAN 995 1012 20010206_1830_1900_ABC_WNT#aaron_brown@0001.dc 1 2 confirm=0 3 5 A confirm=1 6 31 HALF confirm=1 32 77 CENTURY confirm=1 78 110 AGO confirm=1 111 151 LORRAINE confirm=1 111 151 LORAINE confirm=-1 152 196 WAGNER confirm=1 197 212 WAS confirm=1 197 215 WAS confirm=1 213 221 THE confirm=-1 216 219 A confirm=-1 220 253 KIT confirm=-1 220 254 KIT confirm=-1 220 255 KID confirm=-1 222 253 KIT confirm=-1 222 254 KIT confirm=-1 222 255 KID confirm=1 254 265 WITH confirm=1 254 267 WITH confirm=1 255 265 WITH confirm=1 255 267 WITH confirm=1 256 265 WITH confirm=1 256 267 WITH confirm=1 266 272 THE confirm=-1 268 270 A confirm=-1 271 327 CRUSH confirm=-1 271 327 CRASH confirm=-1 273 327 CRUSH confirm=-1 328 360 ~SIL confirm=0

Content Copyright

Portions 1996-1998, 2000-2001 American Broadcasting Company, Inc., 1996-1998, 2000-2001 Cable News Network LP, LLLP, 2000-2001 National Broadcasting Company, 1996-1998 National Public Radio, Inc., 1996-1998 National Satellite Cable Corporation, 1996-1998, 2005, 2007, 2011 Trustees of the University of Pennsylvania

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.