1997 Spanish Broadcast News Speech (HUB4-NE)

Item Name: 1997 Spanish Broadcast News Speech (HUB4-NE)
Author(s): Linguistic Data Consortium
LDC Catalog No.: LDC98S74
ISBN: 1-58563-127-2
ISLRN: 684-931-706-325-2
DOI: https://doi.org/10.35111/mw6a-ab44
Member Year(s): 1998
DCMI Type(s): Sound
Sample Type: 1-channel pcm
Sample Rate: 16000
Data Source(s): broadcast news
Project(s): Hub4
Application(s): speech recognition
Language(s): Spanish
Language ID(s): spa
Online Documentation: LDC98S74 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Linguistic Data Consortium. 1997 Spanish Broadcast News Speech (HUB4-NE) LDC98S74. Web Download. Philadelphia: Linguistic Data Consortium, 1998.
Related Works: View

LDC98S74 - Speech data LDC98T29 - Transcripts


This corpus contains a portion of the acoustic data designated as the training set for the 1997 DARPA HUB4 Spanish Benchmark. It contains speech and transcripts of 30 hours of broadcast news from the following sources: Televisa, Univision and VOA.


All acoustic files are in NIST SPHERE format, without compression. The sample data are 16-bit linear PCM, 16-KHz sample frequency, single channel. Most files contain 30 minutes of recorded material and some contain 60 or 120 minutes (approximately); the sampling format requires roughly two megabytes (MB) per minute of recording, so the file sizes are typically around 60 MB, with some files ranging up to 120 or 240 MB.

The transcripts are in SGML format, using the same markup conventions that have been applied to the other 1997 Broadcast News speech corpora (in English and Mandarin) and are transmitted by FTP, not on the CD-ROMs with speech data.


There are no updates at this time.

Additional Licensing Instructions

This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.

Available Media

View Fees

Login for the applicable fee