File name: README-muc6-dryrun-test-procs
Date of last update: 24 Apr 95


An important update has been made in section 2.1 since the 20 Apr 95
version.  The update is indicated by a vertical bar ("|") in the left
margin.

______________________________________________________________________

                PROCEDURE FOR MUC-6 DRY RUN TESTING


1. TEST PACKAGE AND TEST SCHEDULE

The "input" files (test texts) for the three evaluation tasks will be
available in compressed form for ftp from the MUC-6 host on 24 April.
Participants will be notified by email on 24 April about the exact
name and location of these files.  READ THE TEST PROCEDURE BEFORE YOU
UNCOMPRESS THE "INPUTS."

The texts within each file of test articles have been concatenated
(not tar'ed).  The Named Entity task and the Coreference task use the
same test set, which consists of 30 texts.  There are 100 texts in the
Information Extraction test set.

The "output" files (answer-key templates and scorer configuration
files) will be available on 1 May for ftp from the MUC-6 host.  

You are not to uncompress the "inputs" files until you are ready to
start testing.  Until then, minimize the accessibility of those files,
e.g., put them in a protected directory of someone who is not directly
involved in system development.

ONCE YOU HAVE UNCOMPRESSED THE TEST SET, YOU ARE OBLIGATED TO COMPLETE
THE TEST AND SUBMIT THE RESULTS.  We have tried to make the dry run as
non-threatening as possible by adopting a policy of anonymity.  Please
see section 4 of this test procedure.

Testing may be done any time during the week of 24 April.  The nominal
deadline for completing the test and submitting results is Friday, 28
April, and you are encouraged to meet that deadline.  However, there
is also an absolute deadline, which is 0900 EDT on Monday morning, 1
May.


2. TEST PROCEDURE

2.1 FREEZING THE SYSTEM

When you are ready to run the test, uncompress the "input" file. You
are on your honor not to do this until you have completely frozen your
"core" system and are ready to start testing.  You must stop all
development of the core system once you have uncompressed a test
"input" file.

However, if you are participating in more than one evaluation task,
you may continue development of the knowledge bases of the
|system(s)/module(s) you do not intend to test first, as long as you
|you respect the following constraints:
|
|  (1) You do not look at the texts in the test set used for the first
|test run.  It is critical that you not be exposed to the texts used in
|the first test run, since the Named Entity/Coreference test set is a
|subset of the Information Extraction test set.  To obviate this
|possibility, you should consider designating someone to run the tests
|who is not a key member of the development team.
|
|  (2) You update only those knowledge bases that are completely
|independent of the core system and whose contents are not shared with
the contents of the knowledge bases of the system(s)/module(s) to be
tested first.  In other words, there must be no possibility that the
updated knowledge bases could affect the processing of the
system(s)/module(s) to be tested first.  If, prior to running one of
the remaining tests, you discover problems caused by the updated
knowledge bases, you are not allowed to fix them via updates to
anything other than those knowledge bases.

2.2  RUNNING THE TEST

For each evaluation task that you are participating in, you are to run
the test only once -- you are not permitted to make any changes to
your system until you complete the test.  If you get part way through
the test and get an error that requires user intervention, you may
intervene only to the extent that you are able to continue processing
with the NEXT text.  You are not allowed to back up!

   Notes:  1) If you run short on time and wish to break up the test sets
              and run portions of them in parallel, that's fine as long
              as you are truly running in parallel with a single 
              system or can completely simulate a parallel environment,
              i.e., the systems are identically configured.  You must
              also be sure to concatenate the outputs before submitting
              them.

           2) No debugging of linguistic capability can be done when
              the system breaks.  For example, if your system breaks
              when it encounters an unknown word and your only option
              for a graceful recovery is to define the word, then
              abort processing and start it up again on the next test 
              text.

           3) If you get an error that requires that you reboot the
              system, you may do so, but you must pick up processing
              with the text FOLLOWING the one that was being
              processed when the error occurred.  If, in order to pick
              up processing at that point, you need to create a new
              version of the test set that excludes the texts already
              processed or you need to start a new output file, that's
              ok.  Be sure to concatenate the output files before
              submitting them.


3.  SCORING THE SYSTEM RESPONSE FILES

After 1 May, when the test package "output" files are available for
ftp, you are invited to make the scoring runs described below and to
report any scores you feel need to be adjudicated.  However, you are
NOT REQUIRED to do the scoring; the templates will be scored for you
by the evaluators.

Edit the configuration files to supply the proper pathnames and file
names.  Make no further edits to the configuration files.


4. SUBMITTING FILES TO NRAD (DEADLINE: 0900 EDT ON MONDAY, 1 MAY)

For each evaluation task that you participate in, you are expected to
submit the following:

  1. A system response file -- The output produced by your system for
each text, concatenated into a single file.

  2. A system trace file -- You may submit whatever you think is
appropriate, i.e., whatever would serve to help validate the results
of testing, if this were the formal run of the evaluation.  For the
dry run, this is merely an exercise to get you thinking about what
would constitute an appropriate trace.

The results of the dry run are intended to be completely anonymous,
since the dry run is intended as a check on the evaluation tasks and
procedures, not as a check on the quality of the participants'
systems.  To guarantee anonmyity to the fullest extent possible, we
will make extensive use of anonymous ftp, as follows:

  1. Sites will submit their output files via anonymous ftp to a host
at NRaD that is set up with a blind directory called "incoming".  

     The host is pojke.nosc.mil (128.49.29.16).  Log in as userid
"anonymous" and hit <Return> when you are asked for a password.
Change directories to "incoming" and deposit your files.  You will not
be able to list the contents of the directory; however, you should be
able to verify that you successfully deposited a file by using the
"ls" command and including the complete file name as argument to it.

  2. Each site is asked to come up with a fictitious site ID, such as
"WhizKidz", "Adonis", "Timbuktu", or whatever name you want to hear
mentioned in any materials that come out of the dry run.  MAKE UP A
DIFFERENT SITE ID FOR EACH EVALUATION TASK THAT YOU PARTICIPATE IN.
DO NOT TELL ANYONE OUTSIDE YOUR DEVELOPMENT GROUP WHAT YOUR SITE ID'S
ARE.  Note: If your system generates comments within the output files
at runtime that would give clues as to the identity of your site,
please delete them from the output files.

  3. The output file names should contain all three of the following
elements:
     -  the fictitious site ID for a given evaluation task, 
     - an indication of the evaluation task (NE for Named Entity, CO
for Coreference, TE for the Template Element subtask of Information
Extraction, or ST for the Scenario Template subtask of Information
extraction),
     - an indication of the nature of the output (such as "response"
for response files, "trace" for trace files).

  4. For example, if you participate in both Named Entity and
Information Extraction and you choose to identify your site as
"WhizKidz" for Named Entity and "Adonis" for Information Extraction,
you would submit files named something like these:

     WhizKidz.NE.response
     Adonis.TE.response
     Adonis.ST.response

     WhizKidz.NE.trace
     Adonis.TE.trace
     Adonis.ST.trace


5.  ADJUDICATION AND REPORTING

If you perceive errors or other problems in the answer keys that cause
scoring penalties against your system, and you wish to have any of the
scores adjudicated, please create a file with your requests for
adjudication, and deposit the file by anonymous ftp in the incoming
directory.  Create different files for each evaluation task, and label
each one in a way that will allow NRaD to identify it as an
adjudication request for a particular task, e.g.,
"WhizKidz-NE-adjudication".  PLEASE REQUEST ADJUDICATION ONLY IN CASES
WHERE YOU FEEL THE ANSWER KEY IS CLEARLY INCORRECT.

The evaluators will check the incoming directory periodically, and
will respond via email to the muc6-annotators list.  Requests for
adjudication will be handled as expeditiously as possible.  The
evaluators reserve the right to present preliminary results at the
upcoming Tipster meeting in May if there isn't time to do adjudication
before then.

The scores for all systems evaluated in the dry run will be made
available to all participants in the dry run.  PARTICIPANTS MAY
DISCUSS THE RESULTS OF THE DRY RUN AS LONG AS THEY DO NOT IDENTIFY
WHICH RESULTS ARE THEIR OWN AND DO NOT SPECULATE ON THE IDENTITY OF
SITES RESPONSIBLE FOR ANY OTHER RESULTS.