A Note to the User

The material on this disk is copyrighted and is subject to the terms and conditions of the TIPSTER User Agreement, which must be signed in order to obtain a copy of the CD-ROM on which this data is to be found.

The format of this material is not as it came from the publisher, however. It has been (automatically) analyzed and rewritten in various ways, to make it more suitable for use in a research environment. The copyright for these modifications is assigned to the Association for Computational Linguistics.

These disks represent a revision of the first set of disks. There are several files which detail the changes between the previous set of disks and these disks. They are in the files:

README.doc
A detailed list of changes that were made to disk1, disk2 and disk3, as well as to the qrels files
README.d1
A mapping of old document numbers to new document numbers which were changed in the ziff data on disk one.
README.d2
? A mapping of old document numbers to new document numbers which were changed in the ziff data on disk two.
README.tag
A mapping of old tag names to new tag names which were changed on all three disks.
The format uses a labelled bracketing, expressed in the style of SGML (Standard Generalized Markup Language). The SGML DTD's used for verification at NIST are included on the CDs. All five different datasets have their major structures identical for easier reading, but have different minor structures. The philosophy in the formatting both at the University of Pennsylvania and at NIST has been to preserve as much of the original structure as possible, but to provide enough consistency to allow simple decoding of the data.

The major data structures are illustrated below in a sample of the DOE data.

< DOC >
< DOCNO > DOE1-96-0001 < /DOCNO >
< TEXT >
One of the weakest aspects of Prolog is in its access to clauses. This weakness is lamentable as it makes one of Prolog's greatest strengths, its ability to treat programs as data and data as programs, difficult to exploit. This paper proposes modifications to Prolog and shows how they circumvent important problems in Prolog programming in a practical way. For example, the proposed modifications permit Prolog programs that perform efficient database query (join) processing, coroutining, and abstract machine interpretation. These modifications have been used successfully at UCLA, and should be easy to implement within any existing Prolog system.
< /TEXT >
< /DOC >

Every document is bracketed by < DOC > < /DOC> tags and has a unique document number, bracketed by < DOCNO> < /DOCNO> tags. Each beginning tag starts as the first character of a new line, but the ending tags could be on the same line or on later lines.

The datasets have all been compressed using the UNIX compress utility, and are stored in chunks of about 1 megabyte each (uncompressed size).

Both as part of the philosophy of leaving the data as close to the original as possible, and because it is impossible to check all the data manually, there are many "errors" in the data. These range from errors in the original data, such as noise in the AP newswires, or other typographical errors, to errors in the reformatting done at the University of Pennsylvania and at NIST. The error-checking has concentrated on allowing readability of the data rather than on correcting content. This means that there have been automated checks for control characters, for correct matching of the beginning and end tags, and for complete DOC and DOCNO fields. The types of "errors" remaining include fragment sentences, strange formatting around tables or other "non-textual" items, misspellings, missing fields (that are generally missing from the data), etc.

Special thanks should go to Charles Augustine at the University of Pennsylvania, who patiently changed most of the data from unintellible formats into formats that can be easily processed by simple conversion programs.