Treebank-3


Item Name: Treebank-3
Authors: Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz and Ann Taylor
LDC Catalog No.: LDC99T42
ISBN: 1-58563-163-9
Data Type: text
Data Source(s): microphone speech, newswire, telephone speech, transcribed speech, varied
Project(s): GALE, TIDES
Application(s): natural language processing, parsing, tagging
Language(s): English
Language ID(s): eng
Distribution: Web Download
Member fee: $0 for 1999 members
Non-member Fee: US $3150.00
Reduced-License Fee: US $1575.00
Extra-Copy Fee: US $150.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz and Ann Taylor
1999
Treebank-3
Linguistic Data Consortium, Philadelphia

Introduction

This CD-ROM contains the following Treebank-2 Material:

  • One million words of 1989 Wall Street Journal material annotated in Treebank II style.
  • A small sample of ATIS-3 material annotated in Treebank II style.
  • A fully tagged version of the Brown Corpus.
and the following new material:
  • Switchboard tagged, dysfluency-annotated, and parsed text
  • Brown parsed text

The Treebank bracketing style is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied.

Data

The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the raw text for each story. Three map files are available in a compressed file via ftp and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.

Updates

After publication, it was discovered that not all of the postscript (*.ps) files had been converted to pdfs and that some of the converted pdfs contained errors. For pdf copies of the documentation files, please go to addenda for a list of the files available.

Copyright