Home › Language Resources › Data

North American News Text, Complete

Item Name:	North American News Text, Complete
Author(s):	David Graff
LDC Catalog No.:	LDC2008T15
ISBN:	1-58563-483-2
ISLRN:	273-098-424-167-4
DOI:	https://doi.org/10.35111/mvpj-r744
Release Date:	August 19, 2008
Member Year(s):	2008
DCMI Type(s):	Text
Data Source(s):	newswire
Application(s):	natural language processing, linguistic analysis, machine learning, language modeling, information retrieval, information extraction, information detection, parsing
Language(s):	English
Language ID(s):	eng
License(s):	North American News Text, Complete
Online Documentation:	LDC2008T15 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Graff, David. North American News Text, Complete LDC2008T15. Web Download. Philadelphia: Linguistic Data Consortium, 2008.
Related Works: Hide	View isSameAs LDC95T21 North American News Text Corpus hasVersion LDC2008T16 North American News Text, General Release hasAnnotation LDC2008T13 BLLIP North American News Text, Complete hasContinuation LDC98T30 North American News Text Supplement

Introduction

North American News Text, Complete is a collection of English news text from the Los Angeles Times, Washington Post, New York Times, Reuters and the Wall Street Journal. This corpus was originally released in 1995 as the North American News Text Corpus (LDC95T21) and is reissued to complement the release of the Brown Laboratory for Linguistic Information Processing (BLLIP) North American News Text sets (LDC2008T13, LDC2008T14), which consist of Penn Treebank-style parsing of that news text.

North American News Text is reissued in two versions: North American News Text, Complete LDC2008T15, the members-only original version, now available as a 2008 Membership Year corpus; and North American News Text, General Release LDC2008T16 (which does not include text from the Wall Journal Street Journal), available to nonmembers for the first time. The directory structure of each of these publications has been restructured to be identical to the directory structure of the BLLIP releases.

Data

The table below contains a breakdown of the sources, epochs and word counts for the data in the North American News Text releases:

Source	Dates	# Words (millions)
Los Angeles Times & Washington Post	May, 1994 - August 1997	52
New York Times News & Syndicate	July, 1994 - December 1996	173
Reuters News Service (General and Finanical)	April, 1994 - December 1996	85
Wall Street Journal (not in General Release)	July, 1994 - December 1996	40

The New York Times and the Los Angeles Times/Washington Post services include a range of other newspaper sources in their syndicated newswires. The Los Angeles Times/Washington Post material in this corpus includes some news text from the following sources:

Newsday
The Baltimore Sun
The Hartford Courant

The New York Times material in this corpus contains some data from the following sources, although New York Times articles predominate:

Bloomberg Business News
The Boston Globe
Los Angeles Daily News
Fort Worth Star-Telegram
Newsweek
Cox News Service
The Arizona Republic
Seattle Post-Intelligencer
San Francisco Examiner
Houston Chronicle
San Francisco Chronicle
Economist Newspaper Ltd.
Hearst Newspapers

The text content of each data file (following uncompression with the GNU-unzip utility) consists of plain ASCII character data with SGML tags to indicate article boundaries and organization of information within each article.

There are differences among the five primary newswire sources in terms of the number and types of SGML tags used in the text, but the following tag structure is common to all data sets:

-- start of a new article ... -- some variety of "header" tags appears here -- start of the text content of the article

-- all paragraph boundaries are marked by this tag ... -- text data as it is provided by the newswire service

-- end of text content of the article ... -- some variety of "trailer" tags appears here -- end of article

In general, the differences in format among the various newswire sources will be found in the SGML tags that appear between and , and those that appear between and . The actual text content of articles (the region between and ) is consistent in format across sources, except for some uses of the SGML "&..;" notation to represent special characters in the data. For example, "&MD;" is used in the "latwp" material to represent the "em-dash" character, which is typically used to separate the "dateline" from the opening sentence in the first paragraph of each article. There may also be differences in how quotation marks are rendered.

As this re-release is intended to complement the BLLIP North American News Text releases, the directory structure of this corpus is identical to that of the BLLIP publications.

Additional Licensing Instructions

This 'members-only' corpus is available to current LDC members who can request the data at the listed reduced-license fee.

Copyright

Portions ©1994-1996 Dow Jones & Company, Inc., © 1994-1997 Los Angeles Times-Washington Post News Service, Inc., © 1994-1996 New York Times, © 1994-1996 Reuters America, Inc., © 1995-1997, 2008 Trustees of the University of Pennsylvania

North American News Text, Complete

Introduction

Data

Additional Licensing Instructions

Copyright

Available Media

View Fees