File name: README-muc6-test-proc
Version: 04 Oct 95

Note: Updates are identified by a vertical bar ("|") in the left
margin.  THE LATEST UPDATE CONCERNS FTP INSTRUCTIONS FOR DEPOSITING
FILES AT NRAD.  Previous updates concern (1) commitment to conducting
tests, (2) the designation of a "basic" test run, for those who are
conducting optional tests.

______________________________________________________________________

                      PROCEDURE FOR MUC-6 TESTING


1. TEST PACKAGE

The "input" files (test texts) for the four evaluation tasks will be
available in compressed form for ftp from the Linguistic Data
Consortium host on 2 October.  Participants will be notified by email
on 2 October about the exact file names.  READ THE TEST PROCEDURE
BEFORE YOU UNCOMPRESS THE "INPUTS."

The texts within each file of test articles have been concatenated
(not tar'ed).  Each text is identified by a DOCNO tag containing a
unique 10-digit number, e.g., "<DOCNO> 930301-0031. </DOCNO>".

The answer keys and scorer configuration files will be available on 10
October for ftp from the "muc6" account on the NRaD MUC-6 host.


2. TEST SCHEDULE

You are not to uncompress the "inputs" files until you are ready to
start testing.  Until then, minimize the accessibility of those files,
e.g., put them in a protected directory of someone who is not directly
involved in system development.

ONCE YOU HAVE UNCOMPRESSED EITHER OF THE TEST SETS, YOU ARE OBLIGATED
| TO COMPLETE ALL THE TESTS THAT YOU HAVE SIGNED UP FOR AND TO SUBMIT
| THE RESULTS.

Testing may be done any time during the week of 2-6 October.  The
deadline for completing the test and submitting results is 5:00
p.m. (Pacific Daylight Time) on Friday, 6 October.  You are encouraged
to start your runs early enough to meet the deadline, even in the
event of unanticipated hardware or network problems.  If such problems
prevent you from meeting the deadline, you must resolve those problems
and submit your results before noon PDT on Saturday, 7 October, and
you must also send an email message to NRaD to request acceptance of
late results by explaining the circumstances for the late submission.
NRaD will review appeals Saturday afternoon.  (Note: Software problems
are not sufficient grounds for appeal.)  Submissions made after noon
on 7 October will not be accepted.

If you intend to carry out any of the optional testing (see below,
section 3.2), you must report the planned optional test(s) to NRaD
before starting the test procedure.  This means that you should
describe concisely how you will alter the behavior of the system and
| what kind of performance differences you expect to obtain.  Your
| description should also characterize the run that you are
| designating as your "basic" run.


3. TEST COMPONENTS

3.1 BASIC

All tests use Wall Street Journal articles from between January 1993
and June 1994.

The Named Entity (NE) task and the Coreference (CO) task use the same
test set, which consists of 30 texts.  These 30 texts are a subset of
the 100 texts that form the test set for both the Template Element
(TE) and Scenario Template (ST) tasks.

All response files (outputs of the systems under evaluation) will be
scored against the manually-produced key files (answer keys) using the
evaluation software prepared for each of the tasks by SAIC in San
Diego, California.

For analysis and presentation of the results of the NE, TE, and ST
tasks, we will be using both the error-based metrics (Error per
Response Fill (ERR), Undergeneration (UND), Overgeneration (OVG), and
Substitution (SUB)) and the recall/precision-based metrics (Recall
(REC), Precision (PRE), and F-Measure (F)).  Statistical significance
testing will be conducted using either the ERR or F metrics, or both;
we have found the rankings based on those two metrics to be very
consistent.

For analysis and presentation of the results of the CO task, we will
be using the only two metrics currently defined for that task, Recall
and Precision.  No statistical significance tests will be conducted.

3.2 OPTIONAL

The objective of the optional testing is to learn more about the
controlled tradeoffs that some systems may be designed to make among
the various metrics.  You are encouraged to design your own
experiments in which you hypothesize significant performance
differences that can be obtained by such means as removing a module or
inserting one that's not part of the basic system or by altering the
control structure of the system such that it produces output more
aggressively or more conservatively.  An experiment may result in a
single new data point or in a continuous performance "curve".

If your system meets one of the following criteria, it is a candidate
for optional testing:

  a) if the system can control performance in order to produce a set
of data points sufficient to plot the outline of a performance curve;

  b) if the system's performance can be consciously manipulated by the
loosening or tightening of analysis constraints, etc., in order to
produce at least one data point that contrasts in an interesting way
with the results of the basic test.

  c) if the system's performance can be consciously manipulated by
substituting one algorithm or method (or set of algorithms or methods)
with another, e.g., to demonstrate significant differences in
performance that result from different approaches to the task.

All optional runs as defined above will be treated as official results
and will therefore appear in the tables of system rankings and scores
that are included in the conference proceedings.  However, if you
conduct many optional tests or if the scores resulting from such a
test are not very different from the results of the system
configuration that is designated the "basic" one, you may be asked to
consider discussing the results in your paper but not reporting them
in their entirety.

In addition to or instead of conducting optional runs, you may wish to
conduct other experimental test runs and summarize them in your paper.
As long as you do not care to have the scores treated as "official,"
you do not need to notify NRaD of the runs in advance.


4. TEST PROCEDURE

4.1 FREEZING THE SYSTEM

When you are ready to run the test, uncompress the "input" file. You
are on your honor not to do this until you have completely frozen your
core system and are ready to start testing.  You must stop all
development of the core system once you have uncompressed a test
"input" file.

However, if you are participating in more than one evaluation task,
you may continue development of the knowledge bases of the
system(s)/module(s) you do not intend to test first, as long as you
you respect the following constraints:

  (1) You do not look at the texts in the test set used for the first
test run.  It is critical that you not be exposed to the texts used in
the first test run, since the NE/CO test set is a subset of the TE/ST
test set.  To obviate this possibility, you should consider
designating someone to run the tests who is not a key member of the
development team.

  (2) You update only those knowledge bases that are completely
independent of the core system and whose contents are not shared with
the contents of the knowledge bases of the system(s)/module(s) to be
tested first.  In other words, there must be no possibility that the
updated knowledge bases could affect the processing of the
system(s)/module(s) to be tested first.  If, prior to running one of
the remaining tests, you discover problems caused by the updated
knowledge bases, you are not allowed to fix them via updates to
anything other than those knowledge bases.

4.2  RUNNING THE TEST

For each evaluation task that you are participating in, you are to run
the test only once -- you are not permitted to make any changes to
your system until you complete the test.  If you get part way through
the test and get an error that requires user intervention, you may
intervene only to the extent that you are able to continue processing
with the NEXT text.  You are not allowed to back up!

   Notes:  1) If you run short on time and wish to break up the test sets
              and run portions of them in parallel, that's fine as long
              as you are truly running in parallel with a single 
              system or can completely simulate a parallel environment,
              i.e., the systems are identically configured.  You must
              also be sure to concatenate the outputs before submitting
              them.

           2) No debugging of linguistic capability can be done when
              the system breaks.  For example, if your system breaks
              when it encounters an unknown word and your only option
              for a graceful recovery is to define the word, then
              abort processing and start it up again on the next test 
              text.

           3) If you get an error that requires that you reboot the
              system, you may do so, but you must pick up processing
              with the text FOLLOWING the one that was being
              processed when the error occurred.  If, in order to pick
              up processing at that point, you need to create a new
              version of the test set that excludes the texts already
              processed or you need to start a new output file, that's
              ok.  Be sure to concatenate the output files before
              submitting them.

4.3. SPECIAL INSTRUCTIONS FOR OPTIONAL TESTING

For each optional run, modify the system as you described in advance
to NRaD.  NO SYSTEM DEVELOPMENT IS PERMITTED between official testing
and optional testing -- only modification of system control parameters
and/or reinsertion or deletion of existing code that affects the
system's behavior with respect to the performance tradeoffs.


5.  SCORING THE SYSTEM RESPONSE FILES

After 10 October, when the test package "output" files are available
for ftp, you are invited to make the scoring runs and to report any
scores you feel need to be adjudicated.  However, you are NOT REQUIRED
to do the scoring; the templates will be scored for you by the
evaluators.

Edit the configuration files to supply the proper pathnames and file
names.  Make no further edits to the configuration files.


6. SUBMITTING FILES TO NRAD (DEADLINE: 5:00 P.M. PDT ON FRI., 6 OCT.)

6.1  WHAT FILES TO SUBMIT

For each evaluation task that you participate in, you are expected to
submit the following:

  1. A system response file -- The output produced by your system for
each text, concatenated into a single file.

  2. A system trace file -- You may submit whatever you think is
appropriate, i.e., whatever would serve to help validate the results
of testing.  Please do not submit files larger than one megabyte if
you can avoid it.

If you conduct optional tests, submit a new response file and trace
file for each run.

6.2  HOW TO NAME AND PACKAGE THE FILES

To enable us to process your files accurately, please follow these
instructions for naming and packaging them:

  1. Include your site name, the identifier for the task (NE, CO, TE,
or ST), and the type of file (response or trace) in EACH file name.
If you conduct optional tests, include an indication of the system
configuration used for the run (e.g., "threshold20" or
"withwordlist" or "vanilla").

  2. Tar the files for each task together, and compress the tar file.
Include your site name in the tar file name.  (Note that this means
that separate tar files will be submitted for each task.  This is to
enable those who are participating in more than one evaluation to
submit the results of each one on separate days, in order to minimize
the risks associated with organizing and submitting all of them at the
end of the week.)

6.3  HOW TO SUBMIT THE FILES

Sites will submit their output files via anonymous ftp to a host
at NRaD that is set up with a blind directory called "incoming".  Here
are instructions for using that host:

  1. Connect to pojke.nosc.mil (128.49.29.16).  

|   2. Log in as userid "anonymous" and enter your email address when
| you are asked for a password.

  3. Change directories to "incoming" and deposit your files.  

  4. You will not be able to list the contents of the directory;
however, you should be able to verify that you successfully deposited
a file by using the "ls" command and including the complete file name
as argument to it.

If for some reason you are not able to connect to pojke.nosc.mil, you
may deposit your files in the password-protected "muc6" account on the
usual NRaD MUC-6 host, c2wcm.nosc.mil.  If you do this, please send an
email message to NRaD to identify the location of your files.


7.  ADJUDICATION

The scores for all systems evaluated will be made available to all
participants.

If you perceive errors or other problems in the answer keys that cause
scoring penalties against your system, and you wish to have any of the
scores adjudicated, please create a file with your requests for
adjudication, and deposit the file by anonymous ftp in the incoming
directory.  Create different files for each evaluation task, and label
each one in a way that will allow NRaD to identify it as an
adjudication request for a particular task, e.g.,
"MySite-NE-adjudication.13oct".  PLEASE REQUEST ADJUDICATION ONLY IN
CASES WHERE YOU FEEL THE ANSWER KEY IS CLEARLY INCORRECT.

The evaluators will check the incoming directory periodically, and
will respond via email to the muc6@cs.nyu.edu list.  Requests for
adjudication will be handled as expeditiously as possible.