BLLIP 1987-89 WSJ Corpus Release 1
|Item Name:||BLLIP 1987-89 WSJ Corpus Release 1|
|Author(s):||Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale, Mark Johnson|
|LDC Catalog No.:||LDC2000T43|
|Application(s):||tagging, parsing, natural language processing|
BLLIP 1987-89 WSJ Corpus Release 1 License Agreement
|Online Documentation:||LDC2000T43 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Charniak, Eugene, et al. BLLIP 1987-89 WSJ Corpus Release 1 LDC2000T43. Web Download. Philadelphia: Linguistic Data Consortium, 2000.|
Brown Laboratory for Linguistic Information Processing (BLLIP)1987-89 WSJ Corpus Release 1 contains a complete, Treebank-style part-of-speech (POS) tagged and parsed version of the three-year Wall Street Journal (WSJ) collection from ACL/DCI (LDC93T1), approximately 30 million words. The annotation was performed using statistically-based methods developed by BLIIP researchers Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale and Mark Johnson.
This corpus both overlaps and supplements the million-word Penn Treebank (PTB) collection of parsed and POS-tagged WSJ texts.
The PTB project selected 2,499 stories from a three-year WSJ collection of 98,732 stories for syntactic annotation. These 2,499 stories are distributed in Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42), both of which include the raw text for each story.
There are no updates at this time.