File name: README-muc6-dryrun-test-procs Date of last update: 24 Apr 95 An important update has been made in section 2.1 since the 20 Apr 95 version. The update is indicated by a vertical bar ("|") in the left margin. ______________________________________________________________________ PROCEDURE FOR MUC-6 DRY RUN TESTING 1. TEST PACKAGE AND TEST SCHEDULE The "input" files (test texts) for the three evaluation tasks will be available in compressed form for ftp from the MUC-6 host on 24 April. Participants will be notified by email on 24 April about the exact name and location of these files. READ THE TEST PROCEDURE BEFORE YOU UNCOMPRESS THE "INPUTS." The texts within each file of test articles have been concatenated (not tar'ed). The Named Entity task and the Coreference task use the same test set, which consists of 30 texts. There are 100 texts in the Information Extraction test set. The "output" files (answer-key templates and scorer configuration files) will be available on 1 May for ftp from the MUC-6 host. You are not to uncompress the "inputs" files until you are ready to start testing. Until then, minimize the accessibility of those files, e.g., put them in a protected directory of someone who is not directly involved in system development. ONCE YOU HAVE UNCOMPRESSED THE TEST SET, YOU ARE OBLIGATED TO COMPLETE THE TEST AND SUBMIT THE RESULTS. We have tried to make the dry run as non-threatening as possible by adopting a policy of anonymity. Please see section 4 of this test procedure. Testing may be done any time during the week of 24 April. The nominal deadline for completing the test and submitting results is Friday, 28 April, and you are encouraged to meet that deadline. However, there is also an absolute deadline, which is 0900 EDT on Monday morning, 1 May. 2. TEST PROCEDURE 2.1 FREEZING THE SYSTEM When you are ready to run the test, uncompress the "input" file. You are on your honor not to do this until you have completely frozen your "core" system and are ready to start testing. You must stop all development of the core system once you have uncompressed a test "input" file. However, if you are participating in more than one evaluation task, you may continue development of the knowledge bases of the |system(s)/module(s) you do not intend to test first, as long as you |you respect the following constraints: | | (1) You do not look at the texts in the test set used for the first |test run. It is critical that you not be exposed to the texts used in |the first test run, since the Named Entity/Coreference test set is a |subset of the Information Extraction test set. To obviate this |possibility, you should consider designating someone to run the tests |who is not a key member of the development team. | | (2) You update only those knowledge bases that are completely |independent of the core system and whose contents are not shared with the contents of the knowledge bases of the system(s)/module(s) to be tested first. In other words, there must be no possibility that the updated knowledge bases could affect the processing of the system(s)/module(s) to be tested first. If, prior to running one of the remaining tests, you discover problems caused by the updated knowledge bases, you are not allowed to fix them via updates to anything other than those knowledge bases. 2.2 RUNNING THE TEST For each evaluation task that you are participating in, you are to run the test only once -- you are not permitted to make any changes to your system until you complete the test. If you get part way through the test and get an error that requires user intervention, you may intervene only to the extent that you are able to continue processing with the NEXT text. You are not allowed to back up! Notes: 1) If you run short on time and wish to break up the test sets and run portions of them in parallel, that's fine as long as you are truly running in parallel with a single system or can completely simulate a parallel environment, i.e., the systems are identically configured. You must also be sure to concatenate the outputs before submitting them. 2) No debugging of linguistic capability can be done when the system breaks. For example, if your system breaks when it encounters an unknown word and your only option for a graceful recovery is to define the word, then abort processing and start it up again on the next test text. 3) If you get an error that requires that you reboot the system, you may do so, but you must pick up processing with the text FOLLOWING the one that was being processed when the error occurred. If, in order to pick up processing at that point, you need to create a new version of the test set that excludes the texts already processed or you need to start a new output file, that's ok. Be sure to concatenate the output files before submitting them. 3. SCORING THE SYSTEM RESPONSE FILES After 1 May, when the test package "output" files are available for ftp, you are invited to make the scoring runs described below and to report any scores you feel need to be adjudicated. However, you are NOT REQUIRED to do the scoring; the templates will be scored for you by the evaluators. Edit the configuration files to supply the proper pathnames and file names. Make no further edits to the configuration files. 4. SUBMITTING FILES TO NRAD (DEADLINE: 0900 EDT ON MONDAY, 1 MAY) For each evaluation task that you participate in, you are expected to submit the following: 1. A system response file -- The output produced by your system for each text, concatenated into a single file. 2. A system trace file -- You may submit whatever you think is appropriate, i.e., whatever would serve to help validate the results of testing, if this were the formal run of the evaluation. For the dry run, this is merely an exercise to get you thinking about what would constitute an appropriate trace. The results of the dry run are intended to be completely anonymous, since the dry run is intended as a check on the evaluation tasks and procedures, not as a check on the quality of the participants' systems. To guarantee anonmyity to the fullest extent possible, we will make extensive use of anonymous ftp, as follows: 1. Sites will submit their output files via anonymous ftp to a host at NRaD that is set up with a blind directory called "incoming". The host is pojke.nosc.mil (128.49.29.16). Log in as userid "anonymous" and hit when you are asked for a password. Change directories to "incoming" and deposit your files. You will not be able to list the contents of the directory; however, you should be able to verify that you successfully deposited a file by using the "ls" command and including the complete file name as argument to it. 2. Each site is asked to come up with a fictitious site ID, such as "WhizKidz", "Adonis", "Timbuktu", or whatever name you want to hear mentioned in any materials that come out of the dry run. MAKE UP A DIFFERENT SITE ID FOR EACH EVALUATION TASK THAT YOU PARTICIPATE IN. DO NOT TELL ANYONE OUTSIDE YOUR DEVELOPMENT GROUP WHAT YOUR SITE ID'S ARE. Note: If your system generates comments within the output files at runtime that would give clues as to the identity of your site, please delete them from the output files. 3. The output file names should contain all three of the following elements: - the fictitious site ID for a given evaluation task, - an indication of the evaluation task (NE for Named Entity, CO for Coreference, TE for the Template Element subtask of Information Extraction, or ST for the Scenario Template subtask of Information extraction), - an indication of the nature of the output (such as "response" for response files, "trace" for trace files). 4. For example, if you participate in both Named Entity and Information Extraction and you choose to identify your site as "WhizKidz" for Named Entity and "Adonis" for Information Extraction, you would submit files named something like these: WhizKidz.NE.response Adonis.TE.response Adonis.ST.response WhizKidz.NE.trace Adonis.TE.trace Adonis.ST.trace 5. ADJUDICATION AND REPORTING If you perceive errors or other problems in the answer keys that cause scoring penalties against your system, and you wish to have any of the scores adjudicated, please create a file with your requests for adjudication, and deposit the file by anonymous ftp in the incoming directory. Create different files for each evaluation task, and label each one in a way that will allow NRaD to identify it as an adjudication request for a particular task, e.g., "WhizKidz-NE-adjudication". PLEASE REQUEST ADJUDICATION ONLY IN CASES WHERE YOU FEEL THE ANSWER KEY IS CLEARLY INCORRECT. The evaluators will check the incoming directory periodically, and will respond via email to the muc6-annotators list. Requests for adjudication will be handled as expeditiously as possible. The evaluators reserve the right to present preliminary results at the upcoming Tipster meeting in May if there isn't time to do adjudication before then. The scores for all systems evaluated in the dry run will be made available to all participants in the dry run. PARTICIPANTS MAY DISCUSS THE RESULTS OF THE DRY RUN AS LONG AS THEY DO NOT IDENTIFY WHICH RESULTS ARE THEIR OWN AND DO NOT SPECULATE ON THE IDENTITY OF SITES RESPONSIBLE FOR ANY OTHER RESULTS.