|Author(s):||Mitchell Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, Ann Taylor|
|LDC Catalog No.:||LDC99T42|
|Data Source(s):||telephone speech, newswire, microphone speech, transcribed speech, varied|
|Application(s):||parsing, natural language processing, tagging|
LDC User Agreement for Non-Members
|Online Documentation:||LDC99T42 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Marcus, Mitchell, et al. Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium, 1999.|
This CD-ROM contains the following Treebank-2 Material:
- One million words of 1989 Wall Street Journal material annotated in Treebank II style.
- A small sample of ATIS-3 material annotated in Treebank II style.
- A fully tagged version of the Brown Corpus.
- Switchboard tagged, dysfluency-annotated, and parsed text
- Brown parsed text
The Treebank bracketing style is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied.
The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the raw text for each story. Three map files are available in a compressed file via ftp and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.
After publication, it was discovered that not all of the postscript (*.ps) files had been converted to pdfs and that some of the converted pdfs contained errors. For pdf copies of the documentation files, please go to addenda for a list of the files available.