English News Text Treebank: Penn Treebank Revised
| Item Name: | English News Text Treebank: Penn Treebank Revised |
| Author(s): | Ann Bies, Justin Mott, Colin Warner |
| LDC Catalog No.: | LDC2015T13 |
| ISBN: | 1-58563-724-6 |
| DOI: | https://doi.org/10.35111/xpjy-at91 |
| Release Date: | July 15, 2015 |
| Member Year(s): | 2015 |
| DCMI Type(s): | Text |
| Data Source(s): | newswire |
| Application(s): | parsing, tagging, part of speech tagging, natural language processing |
| Language(s): | English |
| Language ID(s): | eng |
| License(s): |
LDC User Agreement for Non-Members |
| Online Documentation: | LDC2015T13 Documents |
| Licensing Instructions: | Subscription & Standard Members, and Non-Members |
| Citation: | Bies, Ann, Justin Mott, and Colin Warner. English News Text Treebank: Penn Treebank Revised LDC2015T13. Web Download. Philadelphia: Linguistic Data Consortium, 2015. |
| Related Works: | View |
Introduction
English News Text Treebank: Penn Treebank Revised was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of a combination of automated and manual revisions of the Penn Treebank annotation of Wall Street Journal (WSJ) stories. The data is comprised of 1,203,648 word-level tokens in 49,191 sentence-level tokens -- in all 2,312 of the original Penn Treebank WSJ files.
Data
This release includes revised tokenization, part-of-speech, and syntactic treebank annotation intended to bring the full WSJ treebank section into compliance with the agreed-upon policies and updates implemented for current English treebank annotation specifications at LDC. Examples include English Web Treebank (LDC2012T13), OntoNotes (LDC2013T19), and English translation treebanks such as English Translation Treebank: An-Nahar Newswire (LDC2012T02). English Treebank Supplemental Guidelines are included in this release.
Samples
Please view this treebank and tokenized samples.
Updates
None at this time.