BLLIP 1987-89 WSJ Corpus Release 1
Item Name: | BLLIP 1987-89 WSJ Corpus Release 1 |
Author(s): | Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale, Mark Johnson |
LDC Catalog No.: | LDC2000T43 |
ISBN: | 1-58563-165-5 |
ISLRN: | 233-420-716-637-7 |
DOI: | https://doi.org/10.35111/fwew-da58 |
Member Year(s): | 2000 |
DCMI Type(s): | Text |
Data Source(s): | newswire |
Project(s): | TIDES, GALE |
Application(s): | tagging, parsing, natural language processing |
Language(s): | English |
Language ID(s): | eng |
License(s): |
BLLIP 1987-89 WSJ Corpus Release 1 License Agreement |
Online Documentation: | LDC2000T43 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Charniak, Eugene, et al. BLLIP 1987-89 WSJ Corpus Release 1 LDC2000T43. Web Download. Philadelphia: Linguistic Data Consortium, 2000. |
Related Works: | View |
Introduction
Brown Laboratory for Linguistic Information Processing (BLLIP)1987-89 WSJ Corpus Release 1 contains a complete, Treebank-style part-of-speech (POS) tagged and parsed version of the three-year Wall Street Journal (WSJ) collection from ACL/DCI (LDC93T1), approximately 30 million words. The annotation was performed using statistically-based methods developed by BLIIP researchers Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale and Mark Johnson.
This corpus both overlaps and supplements the million-word Penn Treebank (PTB) collection of parsed and POS-tagged WSJ texts.
Data
The PTB project selected 2,499 stories from a three-year WSJ collection of 98,732 stories for syntactic annotation. These 2,499 stories are distributed in Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42), both of which include the raw text for each story.
Updates
There are no updates at this time.