English News Text Treebank: Penn Treebank Revised
Item Name: | English News Text Treebank: Penn Treebank Revised |
Author(s): | Ann Bies, Justin Mott, Colin Warner |
LDC Catalog No.: | LDC2015T13 |
ISBN: | 1-58563-724-6 |
DOI: | https://doi.org/10.35111/xpjy-at91 |
Release Date: | July 15, 2015 |
Member Year(s): | 2015 |
DCMI Type(s): | Text |
Data Source(s): | newswire |
Application(s): | parsing, tagging, part of speech tagging, natural language processing |
Language(s): | English |
Language ID(s): | eng |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2015T13 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Bies, Ann, Justin Mott, and Colin Warner. English News Text Treebank: Penn Treebank Revised LDC2015T13. Web Download. Philadelphia: Linguistic Data Consortium, 2015. |
Related Works: | View |
Introduction
English News Text Treebank: Penn Treebank Revised was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of a combination of automated and manual revisions of the Penn Treebank annotation of Wall Street Journal (WSJ) stories. The data is comprised of 1,203,648 word-level tokens in 49,191 sentence-level tokens -- in all 2,312 of the original Penn Treebank WSJ files.
Data
This release includes revised tokenization, part-of-speech, and syntactic treebank annotation intended to bring the full WSJ treebank section into compliance with the agreed-upon policies and updates implemented for current English treebank annotation specifications at LDC. Examples include English Web Treebank (LDC2012T13), OntoNotes (LDC2013T19), and English translation treebanks such as English Translation Treebank: An-Nahar Newswire (LDC2012T02). English Treebank Supplemental Guidelines are included in this release.
Samples
Please view this treebank and tokenized samples.
Updates
None at this time.