1996 CSR HUB4 Language Model
Item Name: | 1996 CSR HUB4 Language Model |
Author(s): | Robert MacIntyre |
LDC Catalog No.: | LDC98T31 |
ISBN: | 1-58563-122-1 |
ISLRN: | 905-430-625-113-0 |
DOI: | https://doi.org/10.35111/jvpt-9682 |
Member Year(s): | 1998 |
DCMI Type(s): | Text |
Data Source(s): | broadcast news |
Project(s): | Hub4 |
Application(s): | speech recognition |
Language(s): | English |
Language ID(s): | eng |
License(s): |
1996 CSR Hub-4 Language Model Agreement |
Online Documentation: | LDC98T31 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | MacIntyre, Robert. 1996 CSR HUB4 Language Model LDC98T31. Web Download. Philadelphia: Linguistic Data Consortium, 1998. |
Related Works: | View |
Introduction
This corpus contains data from transcribed news broadcasts, designated for use in the baseline language model (LM) for the 1996 CSR HUB4 Evaluation.
Data
The LDC obtained the bulk of the data from broadcast news CD-ROMs produced by Primary Source Media, Inc. This portion includes the period from January 1992 to April 1996 and contains approximately one gigabyte of data uncompressed. This release also includes about 36 megabytes of material received on floppy disks covering the period from late May through June 1996, with somewhat different format from the bulk of the data.
The text data are presented in two forms: (1) a relatively unprocessed ("raw" or "sentence-tagged") form and (2) a fully processed ("conditioned," "verbalized-punctuation") form. The "raw" form includes the header and footer information accompanying the articles, such as network, show name, headline, copyright, credits and so forth; the text and ancillary data are presented in a fairly consistent (though simple) SGML format. The "processed" form contains only the text content of the articles, together with SGML tags to mark the boundaries of articles, paragraphs and sentences; the text content has been modified by replacing numeric strings (dates, dollar amounts, quantities) with orthographic strings (e.g. "nineteen ninety six"), replacing abbreviations ("Inc.," "Ltd.," "Corp.," etc.) with corresponding full-word forms and replacing punctuation characters with corresponding word tokens (e.g. "," becomes "COMMA"). This release also includes an archive of the tools used to create the "processed" form of the data.
Updates
There are no updates at this time.
Additional Licensing Instructions
This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.