INVENTORY and DESCRIPTIONS
- parsed/
- parsed corpora
The quality of the parses varies somewhat
according to when they were corrected, which
annotator corrected them. and what type of material
is in the texts.
- atis/ Spring 1991
- Air Travel Information System transcripts
The material has a limited number of sentence
types. It was corrected once, but not revised.
-
brown/ November 1991 - November 1992
- Brown Corpus The Brown Corpus is
the cleanest and most consistent subset of the parsed corpus, as it was
recently revised. It also represents the greatest
number of genres.
- ibm/ Summer 1992
- IBM computer manual extracts
The material was chosen for its limited
vocabulary. It was corrected and revised and is
very clean and consistent.
-
muc3/ Winter 1990-1991
- 3rd Message Understanding Conference extracts,
consisting of material translated into English
Since the files were all revised by the same
annotator, the parsing is relatively consistent and
clean. The language of the texts is sometimes a
little odd.
- wbur/ Spring 1991
- Transcripts of broadcasts by WBUR, a public
radio station in Boston.
The material appears to be taken from scripted
broadcasts and probably should not be considered
as speech data.
-
wsj/ September 1990-November 1991
- '88-'89 Wall Street Journal articles
The material was corrected but not revised. The
low-numbered directories contain the project's
earliest corrected parses and may not be very
consistent or usable. The quality of the parsing
varies also according to the annotator who
corrected it. Directories 08 and 16 have serious
problems.
-
misc/
- Variety of texts provided by the ACL Data
Collection Initiative, including Department of
Agriculture pamphlets and passages from 19th
century literary works.
The material was corrected but not revised, and
so is not very clean or reliable. There is no t9
file.
-
tagged/
- Tagged Corpora
In general, the quality of the tagged corpora
varies less than that of the parsed corpora, so all
material should be usable.
-
atis/ Spring 1991
- Air Travel Information System transcripts
The part-of-speech tags were corrected once.
-
brown/ Fall 1989-Winter 1990
- Brown Corpus
As the Brown Corpus texts were the first to have
part-of-speech tags corrected, there may be a few
inconsistencies in some files.
-
doe/ Fall 1989
- Department of Energy abstracts
Some of the abstracts appear to have been written
by non-native speakers. There are two copies
corrected by different annotators.
-
ibm/ Spring 1992
- IBM computer manual extracts.
The part-of-speech tags were all corrected by
one annotator.
-
muc3/ Winter 1990-1991
- 3rd Message Understanding Conference extracts.
The part-of-speech tags were corrected as
consistently as the peculiarities of the texts
allowed.
-
misc/ Winter 1990
- Variety of texts provided by the ACL Data
Collection Initiative, including Department of
Agriculture pamphlets and passages from 19th
century literary works. This directory is
a little unlike that in ...parsed/.
-
source1/
- One version of these files.
-
source2/
- Another version of these files.
-
best/
- The best (cleanest) version of these files.
These files were made by adjudicating the
source1 and source2 files.
-
wbur/ Spring 1991
- Transcripts of broadcasts by WBUR, a public
radio station in Boston.
The part-of-speech tags were corrected once.
-
wsjWinter-Spring 1990
- '88-'89 Wall Street Journal articles.
The part-of-speech tags were correct once.
-
combined/
- Combined Corpora.
These corpora have been automatically created
by inserting the part of speech tags from a
tagged text file (i.e. .pos file) into a
parsed text file (i.e. .par file). The tags
are inserted as nodes immediately dominating
the terminals. The -NONE- node means that
there is no part of speech for that terminal
symbol. As of this release there were still a
few sporadic errors we didn't have time to
remove. If you are curious, the files
COMBINE.LOG in the parsed/ subdirectories
contain a listing of the combination process.
Lines begining with "WARNING" indicate errors.
-
tgrepabl/
- Tgrepable Corpora
These are encoded corpora designed for use
with 'tgrep'. This directory should be empty
on the CD-ROM you have recieved. These
datafiles are part of the tgrep distribution
and can only be installed by installing the
tgrep distribution. Please read the README
file in the tools/tgrep/ directory.
-
tools/
- Source Code for Various Programs.
These programs were designed to be run on UNIX
machines. There are 3 packages. All three
packages are tarred and compressed.
-
doc/
- Documentation
This directory contains information about who
the annotators of the Penn Treebank are and
what they did as well as latex files of the
Penn Treebank's Guide to Parsing and Guide to
Tagging.
The work reported here was primarily funded by DARPA and AFOSR jointly
under grant No.~AFOSR-90-006., with additional support by DARPA grant
No.~N0014-85-K0018 and by ARO grant No.~DAAL 03-89-C0031 PRI. Seed
money was provided by the General Electric Corporation under grant
No.~J01746000. We gratefully acknowledge this support. David
Magerman, Richard Pito and Steven Shapiro deserve our special thanks
for their administrative and programming support. We are also
grateful to AT\&T Bell Labs for permission to use Kenneth Church's
PARTS part-of-speech labeller and Donald Hindle's Fidditch parser.