|Author(s):||Mitchell Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz|
|LDC Catalog No.:||LDC95T7|
|Data Source(s):||varied, transcribed speech, newswire, microphone speech|
|Application(s):||parsing, natural language processing, tagging|
LDC User Agreement for Non-Members
|Online Documentation:||LDC95T7 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Marcus, Mitchell, Beatrice Santorini, and Mary Ann Marcinkiewicz. Treebank-2 LDC95T7. Web Download. Philadelphia: Linguistic Data Consortium, 1995.|
Original release was: LDC Catalog No.: LDC94T4B-3.1 NIST Catalog No.: NA LDC Release date: 4/94 (MY94)
Original Treebank Release
This CD-ROM contains over 1.6 million words of hand-parsed material from the Dow Jones News Service, plus an additional one million words tagged for part-of-speech. This material is a subset of the language model corpus for the DARPA CSR large-vocabulary speech recognition project.
It also contains the first fully parsed version of the Brown Corpus, which has also been completely retagged using the Penn Treebank (PTB) tag set. Also included are tagged and parsed data from Department of Energy abstracts, IBM computer manuals, MUC-3 and ATIS.
In addition, the CD-ROM includes source code for programs that were used by the PTB project in creating portions of the data. Source code is also included for "tgrep," a program that permits the user to search for specific constituents in tree structures. All software is provided "as is." (We have learned since publication that the tgrep source code provided on the cd-rom is not readily portable, and compiling tgrep requires modification of the source files. The CD-ROM does include a pre-compiled program file for tgrep, built for use on Sun sparc systems.)
Release - 2
The PTB Project Release 2 CD-ROM features the new PTB-2 bracketing style, which is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied, along with a complete style manual explaining the bracketing and new versions of tools for searching and treating bracketed data. This CD-ROM also contains all the annotated text material from the earlier Treebank Preliminary Release, including the Brown Corpus. While these materials have not all been converted to the newer bracketing style, they have been cleaned up to remove problems that had appeared in the earlier release.
The contents of Treebank Release 2 are as follows:
- One million words of 1989 Wall Street Journal material annotated in Treebank-2 style.
- A small sample of ATIS-3 material annotated in Treebank-2 style.
- 300-page style manual for Treebank-2 bracketing, as well as the part-of-speech tagging guidelines.
- The contents of the previous Treebank CD-ROM (Version 0.5), with cleaner versions of the WSJ, Brown Corpus, and ATIS material (annotated in Treebank-1 style).
- Tools for processing Treebank data, including "tgrep," a tree-searching and manipulation package (note that usability of this release of tgrep is limited: users of Sun sparc systems should have no problem, but others may find the software to be difficult or impossible to port).
In addition, the PTB Project has provided some updates, announcements and a discussion forum for users. A file of updates and further information is available via anonymous FTP from ftp.cis.upenn.edu, in pub/treebank/doc/update.cd2.
The PTB project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file (pennTB_tipster_wsj_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.