English News Text Treebank: Penn Treebank Revised CatalogID: LDC2015T13 Release date: May 24, 2015 Linguistic Data Consortium Authors: Ann Bies, Justin Mott, Colin Warner 1.0 Introduction This release of the updated Wall Street Journal portion of the Penn Treebank consists of a combination of automated and manual revisions of the treebank annotation of the WSJ data. The data consists of 1,203,648 word-level tokens, in 49,191 sentence-level tokens, in all 2312 of the original Penn Treebank WSJ files. These revisions are intended specifically to bring the full Wall Street Journal portion of the Penn Treebank (PTB WSJ) into compliance with the agreed upon policies and updates implemented for current English Treebank annotation specifications at LDC, including LDC Treebank publications such as the English Web Treebank (LDC2012T13), OntoNotes (LDC2011T03), and English Translation Treebanks such as English Translation Treebank: An-Nahar Newswire (LDC2012T02). Annotation guidelines supplemental to the original Penn Treebank guidelines are available in this release at docs/EnglishTreebankSupplementalGuidelines.pdf. Note that only those updates targeted by the OntoNotes WSJ update were also targeted here, and it was not possible in the scope of this project to correct other annotation errors. The updated treebank annotation on this corpus was completed at LDC in response to a gift from Google Inc. 2. Annotation 2.1 Tasks and Guidelines This release includes revised tokenization, part-of-speech, and syntactic treebank annotation for the Penn Treebank Wall Street Journal data, implementing targeted updates to the annotation. These revisions are intended specifically to bring the full Wall Street Journal portion of the Penn Treebank (PTB WSJ) into compliance with the agreed upon policies and updates implemented for current English Treebank annotation specifications at LDC, including LDC Treebank publications such as the English Web Treebank (LDC2012T13), OntoNotes (LDC2011T03), and English Translation Treebanks such as English Translation Treebank: An-Nahar Newswire (LDC2012T02). Annotation guidelines supplemental to the original Penn Treebank guidelines are available in this release at docs/EnglishTreebankSupplementalGuidelines.pdf. Of the total 2312 PTB WSJ files, 584 files were not included in the OntoNotes release, and these have been updated in this package with all of the updates implemented for OntoNotes. 1728 files were included in the OntoNotes release, OntoNotes Release 5.0. Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, Mohammed El-Bachouti, Robert Belvin, Ann Houston LDC Catalog No.: LDC2013T19. and these have also been updated in this package, to the extent that was necessary to make the implemented updates consistent across the data. Recently improved tree searching and revision methods at LDC have made it possible to locate and implement the changes more consistently across the full WSJ corpus. Note that only those updates targeted by OntoNotes were also targeted here, and it was not possible in the scope of this project to correct other annotation errors. "Sentence" level tokens are separated by line breaks. "Word" level tokens are separated by white space, and each has a POS tag. Bracket representation is as follows: () in the text are represented as -LRB- and -RRB- in the .tree files Trace and empty category indices are indicated on node labels only, as for all treebanks produced at LDC. 2.2 Annotation Process An extensive series of automated updates was developed at LDC, along with a careful series of searches targeting updates that were only possible to complete manually. Manual revisions were completed at LDC. Lead annotators for this project were Justin Mott and Colin Warner. Additional annotators were John Laury and Jonathan Gress-Wright. Project tasks: 1. Update the remaining WSJ (c.300K) to OntoNotes standards: i.e., update tokenization, hyphens/HYPH, small clauses, PRO and raising predicate verbs, clausal PRN, relative clause scope NP levels, add NML, remove NX and NAC (except in the current remaining usage of NAC). This covers the Treebank IIa and the Reconciliation revisions as instituted during the GALE program by both LDC and OntoNotes. a. Starting point for this section is the Treebank3 release data (c.300K words that were not revised for OntoNotes). b. Format, indexing, etc. were updated as well, so that that the data could be used with current annotation tools at LDC. c. Extensively revised searches and developed new searches, as relevant, to cover all cases, starting with current QC searches developed at LDC. d. Extensive automatic revision was possible using the above revision queries and scripts. The remainder were manually corrected or manually adjudicated. 2. Remaining "bugs" and remaining disallowed node labels in the OntoNotes wsj section were fixed. Significantly improved searches, etc., since the original OntoNotes revisions allowed us to find and update cases that were missed (and harder to find) at that time. a. Starting point for this section was the OntoNotes release data. b. Format, indexing, etc. were updated as well, so that that the data could be used in current annotation tools at LDC. c. The same revised searches and new searches as above were used. d. In the same way as above, extensive automatic revision was possible using the above revision queries and scripts. The remainder were manually corrected or manually adjudicated. 3. Note that the updates above, as with the original revisions in the OntoNotes data itself, do not include all of the known potential errors in the PTB WSJ corpus. In particular, the updates for this project fixed POS errors only insofar as they interacted directly with other necessary changes. The original WSJ POS annotation is known to include a number of inconsistencies. However, updating the full WSJ POS annotation to the current standard was beyond the scope of this project. 3. Source Data Profile The Wall Street Journal portion of the OntoNotes release was used for the files that had already undergone the original OntoNotes update (1728 files of the total 2312 PTB WSJ files): OntoNotes Release 5.0. Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, Mohammed El-Bachouti, Robert Belvin, Ann Houston LDC Catalog No.: LDC2013T19. Files from the Wall Street Journal portion of the Treebank3 release were used for the remainder of the WSJ files that had not undergone the previous OntoNotes update (584 files of the total 2312 PTB WSJ files): Treebank-3. Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz and Ann Taylor LDC Catalog No.: LDC99T42. 4. Annotated Data Profile The annotated data consists of 1,203,648 word-level tokens, in 49,191 sentence-level tokens, in all 2312 of the original Penn Treebank WSJ files. 5. Data Directory Structure A listing of all of the files in this release can be found in docs/file.tbl. A listing of the data filenames can be found in docs/file.ids. Annotation guidelines supplemental to the original Penn Treebank guidelines can be found at docs/EnglishTreebankSupplementalGuidelines.pdf. The data directory structure is as follows: ./data/tokenized_source/<##>/ -- the tokenized WSJ text. Word-level tokens are separated by white space. Sentence-level tokens are separated by newlines and preceded by delimiters. ./data/penntree/<##>/ -- the treebank annotation files in Penn Treebank bracketed list style. The data files are distributed in the original PTB 00-24 directories under both ./data/ subdirectories. 6. File Format Description 6.1 *.txt (in ./data/tokenized_source/<##>/) Tokenized source files. These contain one sentence per line, with all token boundaries represented by whitespace. Additionally, each line starts with a sentence ID number of the form . Sample: I 'll post highlights from the opinion and dissents when I 'm finished . 6.2 *.tree (in ./data/penntree/<##>/) Bracketed tree files following the basic form (NODE (TAG token)). Each sentence is surrounded by a pair of empty parentheses. Sample: ( (S (NP-SBJ (PRP I)) (VP (MD 'll) (VP (VB post) (NP (NP (NNS highlights)) (PP (IN from) (NP (DT the) (NN opinion) (CC and) (NNS dissents)))) (SBAR-TMP (WHADVP-9 (WRB when)) (S (NP-SBJ (PRP I)) (VP (VBP 'm) (ADJP-PRD (JJ finished)) (ADVP-TMP-9 (-NONE- *T*))))))) (. .)) ) 7. Data Validation Care is taken to maintain the integrity of the data at each step. 8. DTDs None. 9. Copyright Information Portions (c) 1989 Dow Jones & Company, Inc., (c) 1999, 2009, 2011, 2013, 2015 Trustees of the University of Pennsylvania 10. Contact Information Contact info for key project personnel: Ann Bies, Senior Research Coordinator, Linguistic Data Consortium, bies@ldc.upenn.edu 11. Update Log This index was updated on May 26, 2015 by Ann Bies.