English News Text Treebank: Penn Treebank Revised

Item Name: English News Text Treebank: Penn Treebank Revised
Author(s): Ann Bies, Justin Mott, Colin Warner
LDC Catalog No.: LDC2015T13
ISBN: 1-58563-724-6
Release Date: July 15, 2015
Member Year(s): 2015
DCMI Type(s): Text
Data Source(s): newswire
Application(s): parsing, tagging, part of speech tagging, natural language processing
Language(s): English
Language ID(s): eng
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2015T13 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Bies, Ann, Justin Mott, and Colin Warner. English News Text Treebank: Penn Treebank Revised LDC2015T13. Web Download. Philadelphia: Linguistic Data Consortium, 2015.

Introduction

English News Text Treebank: Penn Treebank Revised was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of a combination of automated and manual revisions of the Penn Treebank annotation of Wall Street Journal (WSJ) stories. The data is comprised of 1,203,648 word-level tokens in 49,191 sentence-level tokens -- in all 2,312 of the original Penn Treebank WSJ files.

Data

This release includes revised tokenization, part-of-speech, and syntactic treebank annotation intended to bring the full WSJ treebank section into compliance with the agreed-upon policies and updates implemented for current English treebank annotation specifications at LDC. Examples include English Web Treebank (LDC2012T13), OntoNotes (LDC2013T19), and English translation treebanks such as English Translation Treebank: An-Nahar Newswire (LDC2012T02). English Treebank Supplemental Guidelines are included in this release.

Samples

Please view this treebank and tokenized samples.

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee