BLLIP 1987-89 WSJ Corpus Release 1

Item Name: BLLIP 1987-89 WSJ Corpus Release 1
Author(s): Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale, Mark Johnson
LDC Catalog No.: LDC2000T43
ISBN: 1-58563-165-5
ISLRN: 233-420-716-637-7
Member Year(s): 2000
DCMI Type(s): Text
Data Source(s): newswire
Project(s): TIDES, GALE
Application(s): tagging, parsing, natural language processing
Language(s): English
Language ID(s): eng
License(s): BLLIP 1987-89 WSJ Corpus Release 1 License Agreement
Online Documentation: LDC2000T43 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Charniak, Eugene, et al. BLLIP 1987-89 WSJ Corpus Release 1 LDC2000T43. Web Download. Philadelphia: Linguistic Data Consortium, 2000.

Introduction

Brown Laboratory for Linguistic Information Processing (BLLIP)1987-89 WSJ Corpus Release 1 contains a complete, Treebank-style part-of-speech (POS) tagged and parsed version of the three-year Wall Street Journal (WSJ) collection from ACL/DCI (LDC93T1), approximately 30 million words. The annotation was performed using statistically-based methods developed by BLIIP researchers Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale and Mark Johnson.

This corpus both overlaps and supplements the million-word Penn Treebank (PTB) collection of parsed and POS-tagged WSJ texts.

Data

The PTB project selected 2,499 stories from a three-year WSJ collection of 98,732 stories for syntactic annotation. These 2,499 stories are distributed in Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42), both of which include the raw text for each story.

Updates

There are no updates at this time.

Available Media

View Fees





Login for the applicable fee