|Author(s):||Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, Ann Taylor|
|LDC Catalog No.:||LDC99T42|
|Data Source(s):||telephone speech, newswire, microphone speech, transcribed speech, varied|
|Application(s):||parsing, natural language processing, tagging|
LDC User Agreement for Non-Members
|Online Documentation:||LDC99T42 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Marcus, Mitchell, et al. Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium, 1999.|
This release contains the following Treebank-2 Material:
- One million words of 1989 Wall Street Journal material annotated in Treebank II style.
- A small sample of ATIS-3 material annotated in Treebank II style.
- A fully tagged version of the Brown Corpus.
and the following new material:
- Switchboard tagged, dysfluency-annotated, and parsed text
- Brown parsed text
The Treebank bracketing style is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied.
The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file (pennTB_tipster_wsj_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.
After publication, it was discovered that not all of the postscript (*.ps) files had been converted to pdfs and that some of the converted pdfs contained errors. For pdf copies of the documentation files, please go to addenda for a list of the files available.
As of October 5, 2016 252 wsj files from Treebank-2 were added that were previously missing.
As of February, 2017, 2,499 "raw" wsj files were added from Treebank-2 (LDC95T7).
Corpus downoads after these dates will include these missing files.