Treebank-3
| Item Name: | Treebank-3 |
| Author(s): | Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, Ann Taylor |
| LDC Catalog No.: | LDC99T42 |
| ISBN: | 1-58563-163-9 |
| ISLRN: | 141-282-691-413-2 |
| DOI: | https://doi.org/10.35111/gq1x-j780 |
| Member Year(s): | 1999 |
| DCMI Type(s): | Text |
| Data Source(s): | telephone speech, newswire, microphone speech, transcribed speech, varied |
| Project(s): | TIDES, GALE |
| Application(s): | parsing, natural language processing, tagging |
| Language(s): | English |
| Language ID(s): | eng |
| License(s): |
LDC User Agreement for Non-Members |
| Online Documentation: | LDC99T42 Documents |
| Licensing Instructions: | Subscription & Standard Members, and Non-Members |
| Citation: | Marcus, Mitchell P., et al. Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium, 1999. |
| Related Works: | View |
Introduction
This release contains the following Treebank-2 Material:
- One million words of 1989 Wall Street Journal material annotated in Treebank II style.
- A small sample of ATIS-3 material annotated in Treebank II style.
- A fully tagged version of the Brown Corpus.
and the following new material:
- Switchboard tagged, dysfluency-annotated, and parsed text
- Brown parsed text
The Treebank bracketing style is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied.
Data
The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file (pennTB_tipster_wsj_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.
Samples
Please view the following samples:
- Part-of-Speech Tags
- Dysfluency Annotation
- Dysfluency Annotation & Part-of-Speech Tags
- Dysfluency Annotation, Part-of-Speech Tags & Turns Joined
- Syntactic Annotation
- Syntactic Annotation & Part-of-Speech Tags
Updates
After publication, it was discovered that not all of the postscript (*.ps) files had been converted to pdfs and that some of the converted pdfs contained errors. For pdf copies of the documentation files, please go to addenda for a list of the files available.
As of October 5, 2016 252 wsj files from Treebank-2 were added that were previously missing.
As of February, 2017, 2,499 "raw" wsj files were added from Treebank-2 (LDC95T7).
Corpus downoads after these dates will include these missing files.