NIST Open Machine Translation 2008 Evaluation Sets ================================================== This set contains the evaluation sets (source data and human reference translations), DTDs, scoring software, and evaluation plans from the Current tests (Progress tests are not included) of the NIST Open Machine Translation 2008 Evaluation. Please refer to the evaluation plan included in this package for details on how the evaluation was run. A test set consists of two files, a source and a reference file. Each reference file contains four independent translations of the data set. The evaluation year, source language, test set (which, by default, is "evalset"), version of the data, and source vs. reference file (with the latter being indicated by "-ref") are reflected in the file name. A reference file contains four independent reference translations unless noted otherwise under "Package Contents" below. DARPA TIDES MT and NIST OpenMT evaluations used SGML-formatted test data until 2008 and XML-formatted test data thereafter. The test sets in this package are provided in both formats. Please contact mt_poc@nist.gov with questions. Package Contents ---------------- README.txt This file. Evaluation plan: NISTOpenMT08EvalPlan_v2-4.pdf Test sets: Arabic-to-English, Current test Chinese-to-English, Current test English-to-Chinese, Current test Urdu-to-English, Current test Scoring utility and supporting scripts: mteval-v11b-2008-01-23.tar.gz mteval-v12.pl splitUTF8Characters.c splitUTF8Characters.pl DTD: mteval-v1.2.dtd mteval-xml-v1.2.dtd Data Set Statistics ------------------- Data genres: nw = newswire wb = web data Test set Genre Source/Ref Documents Segments Tokens MT08_Arabic-to-English nw source 74 813 19935 MT08_Arabic-to-English nw reference_1 74 813 26074 MT08_Arabic-to-English nw reference_2 74 813 27676 MT08_Arabic-to-English nw reference_3 74 813 26941 MT08_Arabic-to-English nw reference_4 74 813 25330 MT08_Arabic-to-English wb source 50 547 14532 MT08_Arabic-to-English wb reference_1 50 547 20013 MT08_Arabic-to-English wb reference_2 50 547 21155 MT08_Arabic-to-English wb reference_3 50 547 20816 MT08_Arabic-to-English wb reference_4 50 547 19351 MT08_Chinese-to-English nw source 76 691 27045 MT08_Chinese-to-English nw reference_1 76 691 21689 MT08_Chinese-to-English nw reference_2 76 691 20460 MT08_Chinese-to-English nw reference_3 76 691 19577 MT08_Chinese-to-English nw reference_4 76 691 20693 MT08_Chinese-to-English wb source 33 666 19846 MT08_Chinese-to-English wb reference_1 33 666 16645 MT08_Chinese-to-English wb reference_2 33 666 15870 MT08_Chinese-to-English wb reference_3 33 666 15728 MT08_Chinese-to-English wb reference_4 33 666 17418 MT08_English-to-Chinese nw source 129 1859 41088 MT08_English-to-Chinese nw reference_1 129 1859 68630 MT08_English-to-Chinese nw reference_2 129 1859 71697 MT08_English-to-Chinese nw reference_3 129 1859 73545 MT08_English-to-Chinese nw reference_4 129 1859 74979 MT08_Urdu-to-English nw source 83 708 20029 MT08_Urdu-to-English nw reference_1 83 708 17134 MT08_Urdu-to-English nw reference_2 83 708 16363 MT08_Urdu-to-English nw reference_3 83 708 17044 MT08_Urdu-to-English nw reference_4 83 708 17044 MT08_Urdu-to-English wb source 49 1156 20041 MT08_Urdu-to-English wb reference_1 49 1156 17879 MT08_Urdu-to-English wb reference_2 49 1156 17974 MT08_Urdu-to-English wb reference_3 49 1156 18265 MT08_Urdu-to-English wb reference_4 49 1156 18286 The token counts for Chinese data are "character" counts, which were obtained by counting tokens matching the UNICODE-based regular expression "\w". The token counts for all other languages included here are "word" counts, which were obtained by counting tokens matching the UNICODE-based regular expression "\w+". The Python "re" module was used to obtain these counts.