BLLIP 1987-89 WSJ Corpus Release 1

Item Name: BLLIP 1987-89 WSJ Corpus Release 1
Author(s): Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale, Mark Johnson
LDC Catalog No.: LDC2000T43
ISBN: 1-58563-165-5
ISLRN: 233-420-716-637-7
Member Year(s): 2000
DCMI Type(s): Text
Data Source(s): newswire
Project(s): TIDES, GALE
Application(s): tagging, parsing, natural language processing
Language(s): English
Language ID(s): eng
License(s): BLLIP 1987-89 WSJ Corpus Release 1 License Agreement
Online Documentation: LDC2000T43 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Charniak, Eugene, et al. BLLIP 1987-89 WSJ Corpus Release 1 LDC2000T43. DVD. Philadelphia: Linguistic Data Consortium, 2000.


Brown Laboratory for Linguistic Information Processing (BLLIP)1987-89 WSJ Corpus Release 1 contains a complete, Treebank-style parsing of the three-year Wall Street Journal (WSJ) collection from the ACL/DCI corpus, approximately 30 million words. The parsing and part-of-speech (POS) tagging for the entire archive were done using statistically-based methods developed by Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale and Mark Johnson of BLLIP.

This corpus both overlaps and supplements the million-word Penn Treebank (PTB) collection of parsed and POS-tagged WSJ texts.


The PTB project selected 2,499 stories from a three-year WSJ collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the raw text for each story. Three " map" files are available in a compressed file via ftp and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.


There are no updates at this time.

Available Media

View Fees

Extra Copy
Login for the applicable fee