BLLIP 1987-89 WSJ Corpus Release 1

Item Name: BLLIP 1987-89 WSJ Corpus Release 1
Authors: Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale, and Mark Johnson
LDC Catalog No.: LDC2000T43
ISBN: 1-58563-165-5
Data Type: text
Data Source(s): newswire
Project(s): GALE, TIDES
Application(s): natural language processing, parsing, tagging
Language(s): English
Language ID(s): eng
Distribution: 1 DVD
Member fee: $0 for 2000 members
Non-member Fee: US $300.00
Reduced-License Fee: US $200.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Eugene Charniak, et al.
BLLIP 1987-89 WSJ Corpus Release 1
Linguistic Data Consortium, Philadelphia


Brown Laboratory for Linguistic Information Processing (BLLIP)1987-89 WSJ Corpus Release 1 contains a complete, Treebank-style parsing of the three-year Wall Street Journal (WSJ) collection from the ACL/DCI corpus, approximately 30 million words. The parsing and part-of-speech (POS) tagging for the entire archive were done using statistically-based methods developed by Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale and Mark Johnson of BLLIP.

This corpus both overlaps and supplements the million-word Penn Treebank (PTB) collection of parsed and POS-tagged WSJ texts.


The PTB project selected 2,499 stories from a three-year WSJ collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the raw text for each story. Three " map" files are available in a compressed file via ftp and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.


There are no updates at this time.


Portions 1987-1989 Dow Jones & Company, Inc., 2000 Trustees of the University of Pennsylvania