A Note to the User
The material on this disk is copyrighted and is subject to the
terms and conditions of the TIPSTER User Agreement, which must be
signed in order to obtain a copy of the CD-ROM on which this data is
to be found.
The format of this material is not as it came from the publisher,
however. It has been (automatically) analyzed and rewritten in
various ways, to make it more suitable for use in a research
environment. The copyright for these modifications is assigned
to the Association for Computational Linguistics.
These disks represent a revision of the first set of disks. There
are several files which detail the changes between the previous
set of disks and these disks. They are in the files:
README.doc - A detailed list of changes that were made to disk1,
disk2 and disk3, as well as to the qrels files
README.d1 - A mapping of old document numbers to new document numbers
which were changed in the ziff data on disk one.
README.d2 - A mapping of old document numbers to new document numbers
which were changed in the ziff data on disk two.
README.tag - A mapping of old tag names to new tag names which were
changed on all three disks.
The format uses a labelled bracketing, expressed in the style
of SGML (Standard Generalized Markup Language). The SGML DTD's
used for verification at NIST are included on the CDs. All five
different datasets have their major structures identical for
easier reading, but have different minor structures. The philosophy
in the formatting both at the University of Pennsylvania and at NIST
has been to preserve as much of the original structure as possible,
but to provide enough consistency to allow simple decoding of the data.
The major data structures are illustrated below in a sample of
the DOE data.
DOE1-96-0001
One of the weakest aspects of Prolog is in its access to clauses.
This weakness is lamentable as it makes one of Prolog's greatest strengths,
its ability to treat programs as data and data as programs, difficult to
exploit. This paper proposes modifications to Prolog and shows how they
circumvent important problems in Prolog programming in a practical way.
For example, the proposed modifications permit Prolog programs that perform
efficient database query (join) processing, coroutining, and abstract machine
interpretation. These modifications have been used successfully at UCLA,
and should be easy to implement within any existing Prolog system.
Every document is bracketed by tags and has a unique
document number, bracketed by tags. Each beginning tag
starts as the first character of a new line, but the ending tags could
be on the same line or on later lines.
The datasets have all been compressed using the UNIX compress utility,
and are stored in chunks of about 1 megabyte each (uncompressed size).
Both as part of the philosophy of leaving the data as close to the
original as possible, and because it is impossible to check all the
data manually, there are many "errors" in the data. These range from
errors in the original data, such as noise in the AP newswires, or
other typographical errors, to errors in the reformatting done at the
University of Pennsylvania and at NIST. The error-checking has concentrated
on allowing readability of the data rather than on correcting content.
This means that there have been automated checks for control characters,
for correct matching of the beginning and end tags, and for complete
DOC and DOCNO fields. The types of "errors" remaining include
fragment sentences, strange formatting around tables or other "non-textual"
items, misspellings, missing fields (that are generally missing from the
data), etc.
Special thanks should go to Charles Augustine at the University of
Pennsylvania, who patiently changed most of the data from unintellible formats
into formats that can be easily processed by simple conversion programs.