DARPA TIDES Machine Translation 2004 Evaluation Sets ==================================================== This set contains the evaluation sets (source data and human reference translations), DTDs, scoring software, and evaluation plans from the DARPA TIDES Machine Translation 2004 Evaluation. Please refer to the evaluation plan included in this package for details on how the evaluation was run. A test set consists of two files, a source and a reference file. Each reference file contains four independent translations of the data set. The evaluation year, source language, test set (which, by default, is "evalset"), version of the data, and source vs. reference file (with the latter being indicated by "-ref") are reflected in the file name. A reference file contains four independent reference translations unless noted otherwise under "Package Contents" below. DARPA TIDES MT and NIST OpenMT evaluations used SGML-formatted test data until 2008 and XML-formatted test data thereafter. The test sets in this package are provided in both formats. Please contact mt_poc@nist.gov with questions. Package Contents ---------------- README.txt This file. Evaluation plan: DARPATIDESMT04EvalPlan_v2-1.pdf Test sets: Arabic-to-English Chinese-to-English DTD: mteval-v1.1.dtd Scoring utility: mteval-v11a.pl Data Set Statistics ------------------- Data genres: nw = newswire ps = prepared speech Test set Genre Source/Ref Documents Segments Tokens MT04_Arabic-to-English nw source 150 1075 24257 MT04_Arabic-to-English nw ahd 150 1075 30892 MT04_Arabic-to-English nw ahi 150 1075 32224 MT04_Arabic-to-English nw ahj 150 1075 32656 MT04_Arabic-to-English nw ahm 150 1075 32411 MT04_Arabic-to-English ps source 50 278 8237 MT04_Arabic-to-English ps ahd 50 278 11612 MT04_Arabic-to-English ps ahi 50 278 11969 MT04_Arabic-to-English ps ahj 50 278 11875 MT04_Arabic-to-English ps ahm 50 278 12230 MT04_Chinese-to-English nw source 150 1350 56697 MT04_Chinese-to-English nw cha 150 1350 39970 MT04_Chinese-to-English nw chc 150 1350 39069 MT04_Chinese-to-English nw che 150 1350 39141 MT04_Chinese-to-English nw chf 150 1350 40304 MT04_Chinese-to-English ps source 50 438 18707 MT04_Chinese-to-English ps cha 50 438 14764 MT04_Chinese-to-English ps chc 50 438 13956 MT04_Chinese-to-English ps che 50 438 13943 MT04_Chinese-to-English ps chf 50 438 14359 The token counts for Chinese data are "character" counts, which were obtained by counting tokens matching the UNICODE-based regular expression "\w". The token counts for all other languages included here are "word" counts, which were obtained by counting tokens matching the UNICODE-based regular expression "\w+". The Python "re" module was used to obtain these counts.