File name: README-muc6-test-proc Version: 04 Oct 95 Note: Updates are identified by a vertical bar ("|") in the left margin. THE LATEST UPDATE CONCERNS FTP INSTRUCTIONS FOR DEPOSITING FILES AT NRAD. Previous updates concern (1) commitment to conducting tests, (2) the designation of a "basic" test run, for those who are conducting optional tests. ______________________________________________________________________ PROCEDURE FOR MUC-6 TESTING 1. TEST PACKAGE The "input" files (test texts) for the four evaluation tasks will be available in compressed form for ftp from the Linguistic Data Consortium host on 2 October. Participants will be notified by email on 2 October about the exact file names. READ THE TEST PROCEDURE BEFORE YOU UNCOMPRESS THE "INPUTS." The texts within each file of test articles have been concatenated (not tar'ed). Each text is identified by a DOCNO tag containing a unique 10-digit number, e.g., " 930301-0031. ". The answer keys and scorer configuration files will be available on 10 October for ftp from the "muc6" account on the NRaD MUC-6 host. 2. TEST SCHEDULE You are not to uncompress the "inputs" files until you are ready to start testing. Until then, minimize the accessibility of those files, e.g., put them in a protected directory of someone who is not directly involved in system development. ONCE YOU HAVE UNCOMPRESSED EITHER OF THE TEST SETS, YOU ARE OBLIGATED | TO COMPLETE ALL THE TESTS THAT YOU HAVE SIGNED UP FOR AND TO SUBMIT | THE RESULTS. Testing may be done any time during the week of 2-6 October. The deadline for completing the test and submitting results is 5:00 p.m. (Pacific Daylight Time) on Friday, 6 October. You are encouraged to start your runs early enough to meet the deadline, even in the event of unanticipated hardware or network problems. If such problems prevent you from meeting the deadline, you must resolve those problems and submit your results before noon PDT on Saturday, 7 October, and you must also send an email message to NRaD to request acceptance of late results by explaining the circumstances for the late submission. NRaD will review appeals Saturday afternoon. (Note: Software problems are not sufficient grounds for appeal.) Submissions made after noon on 7 October will not be accepted. If you intend to carry out any of the optional testing (see below, section 3.2), you must report the planned optional test(s) to NRaD before starting the test procedure. This means that you should describe concisely how you will alter the behavior of the system and | what kind of performance differences you expect to obtain. Your | description should also characterize the run that you are | designating as your "basic" run. 3. TEST COMPONENTS 3.1 BASIC All tests use Wall Street Journal articles from between January 1993 and June 1994. The Named Entity (NE) task and the Coreference (CO) task use the same test set, which consists of 30 texts. These 30 texts are a subset of the 100 texts that form the test set for both the Template Element (TE) and Scenario Template (ST) tasks. All response files (outputs of the systems under evaluation) will be scored against the manually-produced key files (answer keys) using the evaluation software prepared for each of the tasks by SAIC in San Diego, California. For analysis and presentation of the results of the NE, TE, and ST tasks, we will be using both the error-based metrics (Error per Response Fill (ERR), Undergeneration (UND), Overgeneration (OVG), and Substitution (SUB)) and the recall/precision-based metrics (Recall (REC), Precision (PRE), and F-Measure (F)). Statistical significance testing will be conducted using either the ERR or F metrics, or both; we have found the rankings based on those two metrics to be very consistent. For analysis and presentation of the results of the CO task, we will be using the only two metrics currently defined for that task, Recall and Precision. No statistical significance tests will be conducted. 3.2 OPTIONAL The objective of the optional testing is to learn more about the controlled tradeoffs that some systems may be designed to make among the various metrics. You are encouraged to design your own experiments in which you hypothesize significant performance differences that can be obtained by such means as removing a module or inserting one that's not part of the basic system or by altering the control structure of the system such that it produces output more aggressively or more conservatively. An experiment may result in a single new data point or in a continuous performance "curve". If your system meets one of the following criteria, it is a candidate for optional testing: a) if the system can control performance in order to produce a set of data points sufficient to plot the outline of a performance curve; b) if the system's performance can be consciously manipulated by the loosening or tightening of analysis constraints, etc., in order to produce at least one data point that contrasts in an interesting way with the results of the basic test. c) if the system's performance can be consciously manipulated by substituting one algorithm or method (or set of algorithms or methods) with another, e.g., to demonstrate significant differences in performance that result from different approaches to the task. All optional runs as defined above will be treated as official results and will therefore appear in the tables of system rankings and scores that are included in the conference proceedings. However, if you conduct many optional tests or if the scores resulting from such a test are not very different from the results of the system configuration that is designated the "basic" one, you may be asked to consider discussing the results in your paper but not reporting them in their entirety. In addition to or instead of conducting optional runs, you may wish to conduct other experimental test runs and summarize them in your paper. As long as you do not care to have the scores treated as "official," you do not need to notify NRaD of the runs in advance. 4. TEST PROCEDURE 4.1 FREEZING THE SYSTEM When you are ready to run the test, uncompress the "input" file. You are on your honor not to do this until you have completely frozen your core system and are ready to start testing. You must stop all development of the core system once you have uncompressed a test "input" file. However, if you are participating in more than one evaluation task, you may continue development of the knowledge bases of the system(s)/module(s) you do not intend to test first, as long as you you respect the following constraints: (1) You do not look at the texts in the test set used for the first test run. It is critical that you not be exposed to the texts used in the first test run, since the NE/CO test set is a subset of the TE/ST test set. To obviate this possibility, you should consider designating someone to run the tests who is not a key member of the development team. (2) You update only those knowledge bases that are completely independent of the core system and whose contents are not shared with the contents of the knowledge bases of the system(s)/module(s) to be tested first. In other words, there must be no possibility that the updated knowledge bases could affect the processing of the system(s)/module(s) to be tested first. If, prior to running one of the remaining tests, you discover problems caused by the updated knowledge bases, you are not allowed to fix them via updates to anything other than those knowledge bases. 4.2 RUNNING THE TEST For each evaluation task that you are participating in, you are to run the test only once -- you are not permitted to make any changes to your system until you complete the test. If you get part way through the test and get an error that requires user intervention, you may intervene only to the extent that you are able to continue processing with the NEXT text. You are not allowed to back up! Notes: 1) If you run short on time and wish to break up the test sets and run portions of them in parallel, that's fine as long as you are truly running in parallel with a single system or can completely simulate a parallel environment, i.e., the systems are identically configured. You must also be sure to concatenate the outputs before submitting them. 2) No debugging of linguistic capability can be done when the system breaks. For example, if your system breaks when it encounters an unknown word and your only option for a graceful recovery is to define the word, then abort processing and start it up again on the next test text. 3) If you get an error that requires that you reboot the system, you may do so, but you must pick up processing with the text FOLLOWING the one that was being processed when the error occurred. If, in order to pick up processing at that point, you need to create a new version of the test set that excludes the texts already processed or you need to start a new output file, that's ok. Be sure to concatenate the output files before submitting them. 4.3. SPECIAL INSTRUCTIONS FOR OPTIONAL TESTING For each optional run, modify the system as you described in advance to NRaD. NO SYSTEM DEVELOPMENT IS PERMITTED between official testing and optional testing -- only modification of system control parameters and/or reinsertion or deletion of existing code that affects the system's behavior with respect to the performance tradeoffs. 5. SCORING THE SYSTEM RESPONSE FILES After 10 October, when the test package "output" files are available for ftp, you are invited to make the scoring runs and to report any scores you feel need to be adjudicated. However, you are NOT REQUIRED to do the scoring; the templates will be scored for you by the evaluators. Edit the configuration files to supply the proper pathnames and file names. Make no further edits to the configuration files. 6. SUBMITTING FILES TO NRAD (DEADLINE: 5:00 P.M. PDT ON FRI., 6 OCT.) 6.1 WHAT FILES TO SUBMIT For each evaluation task that you participate in, you are expected to submit the following: 1. A system response file -- The output produced by your system for each text, concatenated into a single file. 2. A system trace file -- You may submit whatever you think is appropriate, i.e., whatever would serve to help validate the results of testing. Please do not submit files larger than one megabyte if you can avoid it. If you conduct optional tests, submit a new response file and trace file for each run. 6.2 HOW TO NAME AND PACKAGE THE FILES To enable us to process your files accurately, please follow these instructions for naming and packaging them: 1. Include your site name, the identifier for the task (NE, CO, TE, or ST), and the type of file (response or trace) in EACH file name. If you conduct optional tests, include an indication of the system configuration used for the run (e.g., "threshold20" or "withwordlist" or "vanilla"). 2. Tar the files for each task together, and compress the tar file. Include your site name in the tar file name. (Note that this means that separate tar files will be submitted for each task. This is to enable those who are participating in more than one evaluation to submit the results of each one on separate days, in order to minimize the risks associated with organizing and submitting all of them at the end of the week.) 6.3 HOW TO SUBMIT THE FILES Sites will submit their output files via anonymous ftp to a host at NRaD that is set up with a blind directory called "incoming". Here are instructions for using that host: 1. Connect to pojke.nosc.mil (128.49.29.16). | 2. Log in as userid "anonymous" and enter your email address when | you are asked for a password. 3. Change directories to "incoming" and deposit your files. 4. You will not be able to list the contents of the directory; however, you should be able to verify that you successfully deposited a file by using the "ls" command and including the complete file name as argument to it. If for some reason you are not able to connect to pojke.nosc.mil, you may deposit your files in the password-protected "muc6" account on the usual NRaD MUC-6 host, c2wcm.nosc.mil. If you do this, please send an email message to NRaD to identify the location of your files. 7. ADJUDICATION The scores for all systems evaluated will be made available to all participants. If you perceive errors or other problems in the answer keys that cause scoring penalties against your system, and you wish to have any of the scores adjudicated, please create a file with your requests for adjudication, and deposit the file by anonymous ftp in the incoming directory. Create different files for each evaluation task, and label each one in a way that will allow NRaD to identify it as an adjudication request for a particular task, e.g., "MySite-NE-adjudication.13oct". PLEASE REQUEST ADJUDICATION ONLY IN CASES WHERE YOU FEEL THE ANSWER KEY IS CLEARLY INCORRECT. The evaluators will check the incoming directory periodically, and will respond via email to the muc6@cs.nyu.edu list. Requests for adjudication will be handled as expeditiously as possible.