The material on this disk is copyrighted and is subject to the terms and conditions of the TIPSTER User Agreement, which must be signed in order to obtain a copy of the CD-ROM on which this data is to be found.
The format of this material is not as it came from the publisher, however. It has been (automatically) analyzed and rewritten in various ways, to make it more suitable for use in a research environment. The copyright for these modifications is assigned to the Association for Computational Linguistics.
These disks represent a revision of the first set of disks. There are several files which detail the changes between the previous set of disks and these disks. They are in the files:
The major data structures are illustrated below in a sample of the DOE data.
< DOC >
< DOCNO > DOE1-96-0001 < /DOCNO >
< TEXT >
One of the weakest aspects of Prolog is in its access to clauses.
This weakness is lamentable as it makes one of Prolog's greatest strengths,
its ability to treat programs as data and data as programs, difficult to
exploit. This paper proposes modifications to Prolog and shows how they
circumvent important problems in Prolog programming in a practical way.
For example, the proposed modifications permit Prolog programs that perform
efficient database query (join) processing, coroutining, and abstract machine
interpretation. These modifications have been used successfully at UCLA,
and should be easy to implement within any existing Prolog system.
< /TEXT >
< /DOC >
Every document is bracketed by < DOC > < /DOC> tags and has a unique document number, bracketed by < DOCNO> < /DOCNO> tags. Each beginning tag starts as the first character of a new line, but the ending tags could be on the same line or on later lines.
The datasets have all been compressed using the UNIX compress utility, and are stored in chunks of about 1 megabyte each (uncompressed size).
Both as part of the philosophy of leaving the data as close to the original as possible, and because it is impossible to check all the data manually, there are many "errors" in the data. These range from errors in the original data, such as noise in the AP newswires, or other typographical errors, to errors in the reformatting done at the University of Pennsylvania and at NIST. The error-checking has concentrated on allowing readability of the data rather than on correcting content. This means that there have been automated checks for control characters, for correct matching of the beginning and end tags, and for complete DOC and DOCNO fields. The types of "errors" remaining include fragment sentences, strange formatting around tables or other "non-textual" items, misspellings, missing fields (that are generally missing from the data), etc.
Special thanks should go to Charles Augustine at the University of Pennsylvania, who patiently changed most of the data from unintellible formats into formats that can be easily processed by simple conversion programs.