Treebank-3
Item Name: | Treebank-3 |
Author(s): | Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, Ann Taylor |
LDC Catalog No.: | LDC99T42 |
ISBN: | 1-58563-163-9 |
ISLRN: | 141-282-691-413-2 |
DOI: | https://doi.org/10.35111/gq1x-j780 |
Member Year(s): | 1999 |
DCMI Type(s): | Text |
Data Source(s): | telephone speech, newswire, microphone speech, transcribed speech, varied |
Project(s): | TIDES, GALE |
Application(s): | parsing, natural language processing, tagging |
Language(s): | English |
Language ID(s): | eng |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC99T42 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Marcus, Mitchell P., et al. Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium, 1999. |
Related Works: | View |
Introduction
This release contains the following Treebank-2 Material:
- One million words of 1989 Wall Street Journal material annotated in Treebank II style.
- A small sample of ATIS-3 material annotated in Treebank II style.
- A fully tagged version of the Brown Corpus.
and the following new material:
- Switchboard tagged, dysfluency-annotated, and parsed text
- Brown parsed text
The Treebank bracketing style is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied.
Data
The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file (pennTB_tipster_wsj_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.
Samples
Please view the following samples:
- Part-of-Speech Tags
- Dysfluency Annotation
- Dysfluency Annotation & Part-of-Speech Tags
- Dysfluency Annotation, Part-of-Speech Tags & Turns Joined
- Syntactic Annotation
- Syntactic Annotation & Part-of-Speech Tags
Updates
After publication, it was discovered that not all of the postscript (*.ps) files had been converted to pdfs and that some of the converted pdfs contained errors. For pdf copies of the documentation files, please go to addenda for a list of the files available.
As of October 5, 2016 252 wsj files from Treebank-2 were added that were previously missing.
As of February, 2017, 2,499 "raw" wsj files were added from Treebank-2 (LDC95T7).
Corpus downoads after these dates will include these missing files.