Treebank-3

Item Name: Treebank-3
Author(s): Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, Ann Taylor
LDC Catalog No.: LDC99T42
ISBN: 1-58563-163-9
ISLRN: 141-282-691-413-2
DOI: https://doi.org/10.35111/gq1x-j780
Member Year(s): 1999
DCMI Type(s): Text
Data Source(s): telephone speech, newswire, microphone speech, transcribed speech, varied
Project(s): TIDES, GALE
Application(s): parsing, natural language processing, tagging
Language(s): English
Language ID(s): eng
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC99T42 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Marcus, Mitchell P., et al. Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium, 1999.
Related Works: View

Introduction

This release contains the following Treebank-2 Material:

  • One million words of 1989 Wall Street Journal material annotated in Treebank II style.
  • A small sample of ATIS-3 material annotated in Treebank II style.
  • A fully tagged version of the Brown Corpus.

and the following new material:

  • Switchboard tagged, dysfluency-annotated, and parsed text
  • Brown parsed text

The Treebank bracketing style is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied.

Data

The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file (pennTB_tipster_wsj_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.

Samples

Please view the following samples:

Updates

After publication, it was discovered that not all of the postscript (*.ps) files had been converted to pdfs and that some of the converted pdfs contained errors. For pdf copies of the documentation files, please go to addenda for a list of the files available.

As of October 5, 2016 252 wsj files from Treebank-2 were added that were previously missing.

As of February, 2017, 2,499 "raw" wsj files were added from Treebank-2 (LDC95T7).

Corpus downoads after these dates will include these missing files.

Available Media

View Fees





Login for the applicable fee