English News Text Treebank: Penn Treebank Revised
CatalogID: LDC2015T13
Release date: May 24, 2015
Linguistic Data Consortium
Authors: Ann Bies, Justin Mott, Colin Warner

1.0 Introduction

This release of the updated Wall Street Journal portion of the Penn
Treebank consists of a combination of automated and manual revisions
of the treebank annotation of the WSJ data.  The data consists of
1,203,648 word-level tokens, in 49,191 sentence-level tokens, in all
2312 of the original Penn Treebank WSJ files.

These revisions are intended specifically to bring the full Wall
Street Journal portion of the Penn Treebank (PTB WSJ) into compliance
with the agreed upon policies and updates implemented for current
English Treebank annotation specifications at LDC, including LDC
Treebank publications such as the English Web Treebank (LDC2012T13),
OntoNotes (LDC2011T03), and English Translation Treebanks such as
English Translation Treebank: An-Nahar Newswire (LDC2012T02).
Annotation guidelines supplemental to the original Penn Treebank
guidelines are available in this release at
docs/EnglishTreebankSupplementalGuidelines.pdf.

Note that only those updates targeted by the OntoNotes WSJ update were
also targeted here, and it was not possible in the scope of this
project to correct other annotation errors.

The updated treebank annotation on this corpus was completed at LDC in
response to a gift from Google Inc.


2. Annotation

2.1 Tasks and Guidelines

This release includes revised tokenization, part-of-speech, and
syntactic treebank annotation for the Penn Treebank Wall Street
Journal data, implementing targeted updates to the annotation.

These revisions are intended specifically to bring the full Wall
Street Journal portion of the Penn Treebank (PTB WSJ) into compliance
with the agreed upon policies and updates implemented for current
English Treebank annotation specifications at LDC, including LDC
Treebank publications such as the English Web Treebank (LDC2012T13),
OntoNotes (LDC2011T03), and English Translation Treebanks such as
English Translation Treebank: An-Nahar Newswire (LDC2012T02).
Annotation guidelines supplemental to the original Penn Treebank
guidelines are available in this release at
docs/EnglishTreebankSupplementalGuidelines.pdf.

Of the total 2312 PTB WSJ files, 584 files were not included in the
OntoNotes release, and these have been updated in this package with
all of the updates implemented for OntoNotes.  1728 files were
included in the OntoNotes release,

OntoNotes Release 5.0. Ralph Weischedel, Martha Palmer, Mitchell
Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann
Taylor, Jeff Kaufman, Michelle Franchini, Mohammed El-Bachouti, Robert
Belvin, Ann Houston LDC Catalog No.: LDC2013T19.

and these have also been updated in this package, to the extent that
was necessary to make the implemented updates consistent across the
data.  Recently improved tree searching and revision methods at LDC
have made it possible to locate and implement the changes more
consistently across the full WSJ corpus.

Note that only those updates targeted by OntoNotes were also targeted
here, and it was not possible in the scope of this project to correct
other annotation errors.

"Sentence" level tokens are separated by line breaks.

"Word" level tokens are separated by white space, and each has a POS
tag.

Bracket representation is as follows:
() in the text are represented as -LRB- and -RRB- in the .tree files 

Trace and empty category indices are indicated on node labels only, as
for all treebanks produced at LDC.


2.2 Annotation Process

An extensive series of automated updates was developed at LDC, along
with a careful series of searches targeting updates that were only
possible to complete manually.  Manual revisions were completed at
LDC.

Lead annotators for this project were Justin Mott and Colin Warner.
Additional annotators were John Laury and Jonathan Gress-Wright.

Project tasks:

1. Update the remaining WSJ (c.300K) to OntoNotes standards: i.e.,
update tokenization, hyphens/HYPH, small clauses, PRO and raising
predicate verbs, clausal PRN, relative clause scope NP levels, add
NML, remove NX and NAC (except in the current remaining usage of NAC).
This covers the Treebank IIa and the Reconciliation revisions as
instituted during the GALE program by both LDC and OntoNotes.

a. Starting point for this section is the Treebank3 release data
(c.300K words that were not revised for OntoNotes).

b. Format, indexing, etc. were updated as well, so that that the data
could be used with current annotation tools at LDC.

c. Extensively revised searches and developed new searches, as
relevant, to cover all cases, starting with current QC searches
developed at LDC.

d. Extensive automatic revision was possible using the above revision
queries and scripts.  The remainder were manually corrected or
manually adjudicated.

2. Remaining "bugs" and remaining disallowed node labels in the
OntoNotes wsj section were fixed.  Significantly improved searches,
etc., since the original OntoNotes revisions allowed us to find and
update cases that were missed (and harder to find) at that time.

a. Starting point for this section was the OntoNotes release data.

b. Format, indexing, etc. were updated as well, so that that the data
could be used in current annotation tools at LDC.

c. The same revised searches and new searches as above were used.

d. In the same way as above, extensive automatic revision was possible
using the above revision queries and scripts.  The remainder were
manually corrected or manually adjudicated.

3. Note that the updates above, as with the original revisions in the
OntoNotes data itself, do not include all of the known potential
errors in the PTB WSJ corpus.  In particular, the updates for this
project fixed POS errors only insofar as they interacted directly with
other necessary changes.  The original WSJ POS annotation is known to
include a number of inconsistencies.  However, updating the full WSJ
POS annotation to the current standard was beyond the scope of this
project.


3. Source Data Profile

The Wall Street Journal portion of the OntoNotes release was used for
the files that had already undergone the original OntoNotes update
(1728 files of the total 2312 PTB WSJ files):

OntoNotes Release 5.0. Ralph Weischedel, Martha Palmer, Mitchell
Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann
Taylor, Jeff Kaufman, Michelle Franchini, Mohammed El-Bachouti, Robert
Belvin, Ann Houston LDC Catalog No.: LDC2013T19.

Files from the Wall Street Journal portion of the Treebank3 release
were used for the remainder of the WSJ files that had not undergone
the previous OntoNotes update (584 files of the total 2312 PTB WSJ
files):

Treebank-3.  Mitchell P. Marcus, Beatrice Santorini, Mary Ann
Marcinkiewicz and Ann Taylor LDC Catalog No.: LDC99T42.


4. Annotated Data Profile

The annotated data consists of 1,203,648 word-level tokens, in 49,191
sentence-level tokens, in all 2312 of the original Penn Treebank WSJ
files.


5. Data Directory Structure

A listing of all of the files in this release can be found in
docs/file.tbl.  A listing of the data filenames can be found in
docs/file.ids.  Annotation guidelines supplemental to the original
Penn Treebank guidelines can be found at
docs/EnglishTreebankSupplementalGuidelines.pdf.


The data directory structure is as follows:

./data/tokenized_source/<##>/ -- the tokenized WSJ text.  Word-level 
     tokens are separated by white space.  Sentence-level tokens are 
     separated by newlines and preceded by <en=#> delimiters.
./data/penntree/<##>/ -- the treebank annotation files in Penn Treebank 
     bracketed list style.

The data files are distributed in the original PTB 00-24 directories under
both ./data/ subdirectories.


6. File Format Description

6.1 *.txt (in ./data/tokenized_source/<##>/)

Tokenized source files.  These contain one sentence per line, with all
token boundaries represented by whitespace.  Additionally, each line
starts with a sentence ID number of the form <en=1>.  Sample:

<en=10> I 'll post highlights from the opinion and dissents when I 'm finished .

6.2 *.tree (in ./data/penntree/<##>/)

Bracketed tree files following the basic form (NODE (TAG token)).  Each
sentence is surrounded by a pair of empty parentheses.  Sample:

( (S (NP-SBJ (PRP I)) (VP (MD 'll) (VP (VB post) (NP (NP (NNS highlights)) (PP (IN from) (NP (DT the) (NN opinion) (CC and) (NNS dissents)))) (SBAR-TMP (WHADVP-9 (WRB when)) (S (NP-SBJ (PRP I)) (VP (VBP 'm) (ADJP-PRD (JJ finished)) (ADVP-TMP-9 (-NONE- *T*))))))) (. .)) )


7. Data Validation

Care is taken to maintain the integrity of the data at each step.


8. DTDs

None.


9. Copyright Information

Portions (c) 1989 Dow Jones & Company, Inc., (c) 1999, 2009, 2011, 2013, 2015
Trustees of the University of Pennsylvania


10. Contact Information

Contact info for key project personnel: 
Ann Bies, Senior Research Coordinator, Linguistic Data Consortium, 
bies@ldc.upenn.edu


11. Update Log

This index was updated on May 26, 2015 by Ann Bies.