PENN Treebank Project README file

INVENTORY and DESCRIPTIONS

parsed/: parsed corpora The quality of the parses varies somewhat according to when they were corrected, which annotator corrected them. and what type of material is in the texts.
atis/ Spring 1991: Air Travel Information System transcripts The material has a limited number of sentence types. It was corrected once, but not revised.
brown/ November 1991 - November 1992: Brown Corpus The Brown Corpus is the cleanest and most consistent subset of the parsed corpus, as it was recently revised. It also represents the greatest number of genres.
ibm/ Summer 1992: IBM computer manual extracts The material was chosen for its limited vocabulary. It was corrected and revised and is very clean and consistent.
muc3/ Winter 1990-1991: 3rd Message Understanding Conference extracts, consisting of material translated into English Since the files were all revised by the same annotator, the parsing is relatively consistent and clean. The language of the texts is sometimes a little odd.
wbur/ Spring 1991: Transcripts of broadcasts by WBUR, a public radio station in Boston. The material appears to be taken from scripted broadcasts and probably should not be considered as speech data.
wsj/ September 1990-November 1991: '88-'89 Wall Street Journal articles The material was corrected but not revised. The low-numbered directories contain the project's earliest corrected parses and may not be very consistent or usable. The quality of the parsing varies also according to the annotator who corrected it. Directories 08 and 16 have serious problems.
misc/: Variety of texts provided by the ACL Data Collection Initiative, including Department of Agriculture pamphlets and passages from 19th century literary works. The material was corrected but not revised, and so is not very clean or reliable. There is no t9 file.
tagged/: Tagged Corpora In general, the quality of the tagged corpora varies less than that of the parsed corpora, so all material should be usable.
atis/ Spring 1991: Air Travel Information System transcripts The part-of-speech tags were corrected once.
brown/ Fall 1989-Winter 1990: Brown Corpus As the Brown Corpus texts were the first to have part-of-speech tags corrected, there may be a few inconsistencies in some files.
doe/ Fall 1989: Department of Energy abstracts Some of the abstracts appear to have been written by non-native speakers. There are two copies corrected by different annotators.
ibm/ Spring 1992: IBM computer manual extracts. The part-of-speech tags were all corrected by one annotator.
muc3/ Winter 1990-1991: 3rd Message Understanding Conference extracts. The part-of-speech tags were corrected as consistently as the peculiarities of the texts allowed.
misc/ Winter 1990: Variety of texts provided by the ACL Data Collection Initiative, including Department of Agriculture pamphlets and passages from 19th century literary works. This directory is a little unlike that in ...parsed/.
source1/: One version of these files.
source2/: Another version of these files.
best/: The best (cleanest) version of these files. These files were made by adjudicating the source1 and source2 files.
wbur/ Spring 1991: Transcripts of broadcasts by WBUR, a public radio station in Boston. The part-of-speech tags were corrected once.
wsjWinter-Spring 1990: '88-'89 Wall Street Journal articles. The part-of-speech tags were correct once.
combined/: Combined Corpora. These corpora have been automatically created by inserting the part of speech tags from a tagged text file (i.e. .pos file) into a parsed text file (i.e. .par file). The tags are inserted as nodes immediately dominating the terminals. The -NONE- node means that there is no part of speech for that terminal symbol. As of this release there were still a few sporadic errors we didn't have time to remove. If you are curious, the files COMBINE.LOG in the parsed/ subdirectories contain a listing of the combination process. Lines begining with "WARNING" indicate errors.
tgrepabl/: Tgrepable Corpora These are encoded corpora designed for use with 'tgrep'. This directory should be empty on the CD-ROM you have recieved. These datafiles are part of the tgrep distribution and can only be installed by installing the tgrep distribution. Please read the README file in the tools/tgrep/ directory.
tools/: Source Code for Various Programs. These programs were designed to be run on UNIX machines. There are 3 packages. All three packages are tarred and compressed.
doc/: Documentation This directory contains information about who the annotators of the Penn Treebank are and what they did as well as latex files of the Penn Treebank's Guide to Parsing and Guide to Tagging.

The work reported here was primarily funded by DARPA and AFOSR jointly under grant No.~AFOSR-90-006., with additional support by DARPA grant No.~N0014-85-K0018 and by ARO grant No.~DAAL 03-89-C0031 PRI. Seed money was provided by the General Electric Corporation under grant No.~J01746000. We gratefully acknowledge this support. David Magerman, Richard Pito and Steven Shapiro deserve our special thanks for their administrative and programming support. We are also grateful to AT\&T Bell Labs for permission to use Kenneth Church's PARTS part-of-speech labeller and Donald Hindle's Fidditch parser.