Home › Language Resources › Data

1997 Spanish Broadcast News Transcripts (HUB4-NE)

Item Name:	1997 Spanish Broadcast News Transcripts (HUB4-NE)
Author(s):	Elisa Munoz, Jennifer Alabiso, David Graff
LDC Catalog No.:	LDC98T29
ISBN:	1-58563-128-0
ISLRN:	873-191-836-513-0
DOI:	https://doi.org/10.35111/1b28-g771
Member Year(s):	1998
DCMI Type(s):	Text
Data Source(s):	broadcast news
Project(s):	Hub4
Application(s):	speech recognition
Language(s):	Spanish
Language ID(s):	spa
Online Documentation:	LDC98T29 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Munoz, Elisa, Jennifer Alabiso, and David Graff. 1997 Spanish Broadcast News Transcripts (HUB4-NE) LDC98T29. Web Download. Philadelphia: Linguistic Data Consortium, 1998.
Related Works: Hide	View isAnnotationOf LDC98S74 1997 Spanish Broadcast News Speech (HUB4-NE) isContinuationOf LDC98T24 1997 Mandarin Broadcast News Transcripts (HUB4-NE) LDC98T28 1997 English Broadcast News Transcripts (HUB4) hasContinuation LDC2001S91 1997 HUB4 Broadcast News Evaluation Non-English Test Material LDC2002S11 1997 HUB4 English Evaluation Speech and Transcripts isSimilarWith LDC97T22 1996 English Broadcast News Transcripts (HUB4) LDC2000S86 1998 HUB4 Broadcast News Evaluation English Test Material LDC2000S88 1999 HUB4 Broadcast News Evaluation English Test Material

Introduction

This corpus contains a portion of the acoustic data designated as the training set for the 1997 DARPA HUB4 Spanish Benchmark. It contains speech and transcripts of 30 hours of broadcast news from the following sources: Televisa, Univision and VOA.

Corresponding speech data is released as 1997 Spanish Broadcast News Speech (HUB4-NE) (LDC98S74)

Data

All acoustic files are in NIST SPHERE format, without compression. The sample data are 16-bit linear PCM, 16-KHz sample frequency, single channel. Most files contain 30 minutes of recorded material, and some contain 60 or 120 minutes (approximately); the sampling format requires roughly two megabytes (MB) per minute of recording, so the file sizes are typically around 60 MB, with some files ranging up to 120 or 240 MB.

The transcripts are in SGML format, using the same markup conventions that have been applied to the other 1997 Broadcast News speech corpora (in English and Mandarin).

1997 Spanish Broadcast News Transcripts (HUB4-NE)

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees