BLLIP 1987-89 WSJ Corpus Release 1
|Item Name:||BLLIP 1987-89 WSJ Corpus Release 1|
|Author(s):||Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale, Mark Johnson|
|LDC Catalog No.:||LDC2000T43|
|Application(s):||tagging, parsing, natural language processing|
BLLIP 1987-89 WSJ Corpus Release 1 License Agreement
|Online Documentation:||LDC2000T43 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Charniak, Eugene, et al. BLLIP 1987-89 WSJ Corpus Release 1 LDC2000T43. DVD. Philadelphia: Linguistic Data Consortium, 2000.|
Brown Laboratory for Linguistic Information Processing (BLLIP)1987-89 WSJ Corpus Release 1 contains a complete, Treebank-style parsing of the three-year Wall Street Journal (WSJ) collection from the ACL/DCI corpus, approximately 30 million words. The parsing and part-of-speech (POS) tagging for the entire archive were done using statistically-based methods developed by Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale and Mark Johnson of BLLIP.
This corpus both overlaps and supplements the million-word Penn Treebank (PTB) collection of parsed and POS-tagged WSJ texts.
The PTB project selected 2,499 stories from a three-year WSJ collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the raw text for each story. Three " map" files are available in a compressed file via ftp and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.