North American News Text Corpus

Item Name: North American News Text Corpus
Author(s): David Graff
LDC Catalog No.: LDC95T21
ISBN: 1-58563-053-5
ISLRN: 667-148-284-023-7
Member Year(s): 1995, 1996, 1997
DCMI Type(s): Text
Data Source(s): newswire
Project(s): TIDES, MUC, Hub4, GALE, EARS
Application(s): language modeling, information retrieval
Language(s): English
Language ID(s): eng
License(s): North American News Text Agreement
Online Documentation: LDC95T21 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Graff, David. North American News Text Corpus LDC95T21. Web Download. Philadelphia: Linguistic Data Consortium, 1995.

North American News Text Corpus is composed of English newswire text formatted using TIPSTER-style SGML markup from the following sources:

Los Angeles Times/Washington Post Service  05/94-08/97 -  52 million words 

New York Times News 07/94-12/96 - 173 million words

Reuters News Service 04/94-12/96 - 85 million words

Wall Street Journal 07/94-12/96 - 40 million words

The New York Times and the L. A. Times/Washington Post services also  include a range of other newspaper sources in their syndicated newswires. The Los Angeles Times/Washington Post material includes the following sources (in lesser amounts) in addition to the two predominant sources:

  • Newsday
  • The Baltimore Sun
  • The Hartford Courant

The New York Times material contains the following sources in lesser amounts, but New York Times articles predominate:

  • Bloomberg Business News
  • The Boston Globe
  • Los Angeles Daily News
  • Fort Worth Star-Telegram
  • Newsweek
  • Cox News Service
  • The Arizona Republic
  • Seattle Post-Intelligencer
  • San Francisco Examiner
  • Houston Chronicle
  • San Francisco Chronicle
  • Economist Newspaper Ltd.
  • Hearst Newspapers

These newswire services also include small numbers of articles from a larger set of miscellaneous sources. The ones listed above appear with some frequency on a daily basis.

Additional Licensing Instructions

This 'members-only' corpus is available to current LDC members who can request the data at the listed reduced-license fee.

Available Media

View Fees





Login for the applicable fee