BLLIP 1987-89 WSJ Corpus Release 1


Item Name: BLLIP 1987-89 WSJ Corpus Release 1
Authors: Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale, and Mark Johnson
LDC Catalog No.: LDC2000T43
ISBN: 1-58563-165-5
Data Type: text
Data Source(s): newswire
Project(s): GALE, TIDES
Application(s): natural language processing, parsing, tagging
Language(s): English
Language ID(s): eng
Distribution: 1 DVD
Member fee: $0 for 2000 members
Non-member Fee: US $300.00
Reduced-License Fee: US $200.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Eugene Charniak, et al.
2000
BLLIP 1987-89 WSJ Corpus Release 1
Linguistic Data Consortium, Philadelphia

Introduction

Brown Laboratory for Linguistic Information Processing (BLLIP)1987-89 WSJ Corpus Release 1 contains a complete, Treebank-style parsing of the three-year Wall Street Journal (WSJ) collection from the ACL/DCI corpus, approximately 30 million words. The parsing and part-of-speech (POS) tagging for the entire archive were done using statistically-based methods developed by Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale and Mark Johnson of BLLIP.

This corpus both overlaps and supplements the million-word Penn Treebank (PTB) collection of parsed and POS-tagged WSJ texts.

Data

The PTB project selected 2,499 stories from a three-year WSJ collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the raw text for each story. Three " map" files are available in a compressed file via ftp and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.

Updates

There are no updates at this time.

Copyright

Portions 1987-1989 Dow Jones & Company, Inc., 2000 Trustees of the University of Pennsylvania