File: README.txt Date: November 3, 2010 This directory contains the system input files for the Arabic, English and Mandarin 2006 Spoken Term Detection Evaluation Test Set. The Broadcast News and Conversational Telephone Speech data is licensed through the Linguistic Data Consortium and the AMI meeting data is licensed through the AMI project. The LDC license text can be found in the 'licenses/STD_2006_Eval_Agreement-v2.pdf' file and the AMI license can be found at corpus.amiproject.com. The evaluation task specification, directory structure explanation, and file format definitions can be found in Appendix A of the STD Evaluation Plan 'doc/std06-evalplan-v10.pdf' and the errata 'doc/std06-evalplan-v10-errata-v2.pdf'. Appendix A describes the overall structure of the data resources. The following specific resources are supplied in this release. 1. System Input Experimental Control Files (ECF): The system input ECF files are located in the 'indices' directory. These files define the full extent of the excerpts to be processed by the system. They are the same ECF files as used by participants in the 2006 evaluation. Users wanting to replicate the 2006 evaluation should use these files as system inputs. There is a separate file for each language: Arabic: expt_06_std_eval06_arab_all_spch_expt_2.ecf.xml English: expt_06_std_eval06_eng_all_spch_expt_1.ecf.xml Mandarin: expt_06_std_eval06_mand_all_spch_expt_1.ecf.xml 2. Scoring Input Experimental Control Files (ECF): The scoring ECF files are located in the 'indices' directory. These ECF files define the extent of scorable material in the test data. Unlike the system input ECF file, the scoring ECF contains excerpts defining the evaluable regions of the recordings. These files should not be used in any way by the system as the scoring ECF file was built by extracting information from the human annotations. There is a separate file for each language: Arabic: expt_06_std_eval06_arab_all_spch_expt_2.scoring.ecf.xml English: expt_06_std_eval06_eng_all_spch_expt_1.scoring.ecf.xml Mandarin: expt_06_std_eval06_mand_all_spch_expt_1.scoring.ecf.xml 3. System Input Term Lists: The system input term lists are located in the 'indices' directory. These files define just the terms a system must search for. No other information about the terms is provide to the system. These files were used by evaluation participants in 2006. Users wanting to replicate the 2006 evaluation should use these files as system inputs. The following term lists are provided: Arabic diacritized terms: expt_06_std_eval06_arab_all_spch_expt_1.dia.tlist.xml Arabic non-diacritized terms: expt_06_std_eval06_arab_all_spch_expt_2.nondia.tlist.xml Note: Removing diacritics create lexical ambiguity so there is not a 1:1 mapping between the diacritized and non-diacritized terms. English: expt_06_std_eval06_eng_all_spch_expt_1.tlist.xml Mandarin: expt_06_std_eval06_mand_all_spch_expt_1.tlist.xml 4. Scoring Input Term Lists; The scoring term lists files are located in the 'indices' directory. These files define the terms for the evaluation AND additional annotations for scoring system outputs. Unlike the system input ECF file, the scoring ECF contains excerpts defining the evaluable regions of the recordings. These files should not be used in any way by the system. Arabic diacritized terms: expt_06_std_eval06_arab_all_spch_expt_1.dia.annot.tlist.xml Arabic non-diacritized terms: expt_06_std_eval06_arab_all_spch_expt_2.nondia.annot.tlist.xml Note: Removing diacritics create lexical ambiguity so there is not a 1:1 mapping between the diacritized and non-diacritized terms. English: expt_06_std_eval06_eng_all_spch_expt_1.annot+syll.tlist.xml Mandarin: expt_06_std_eval06_mand_all_spch_expt_2.annot+syll.tlist.xml 5. Scoring Software and Example Scoring Run The scoring software is provided in the compressed tar file 'software/STDEval-0.7'. Installation instructions are in the README files. The 'doc' directory contains a set of example data files showing how to run the scoring code. Execute the scoring run by changing directory to 'doc' and then executing the command 'sh random.com'. This command file show the options used during the 2006 evaluation.