NIST Open Machine Translation 2009 Evaluation Sets ================================================== This set contains the evaluation sets (source data and human reference translations), DTDs, scoring software, and evaluation plans from the Current tests (Progress tests are not included) of the NIST Open Machine Translation 2009 Evaluation. Please refer to the evaluation plan included in this package for details on how the evaluation was run. A test set consists of two files, a source and a reference file. Each reference file contains four independent translations of the data set. The evaluation year, source language, test set (which, by default, is "evalset"), version of the data, and source vs. reference file (with the latter being indicated by "-ref") are reflected in the file name. A reference file contains four independent reference translations unless noted otherwise under "Package Contents" below. DARPA TIDES MT and NIST OpenMT evaluations used SGML-formatted test data until 2008 and XML-formatted test data thereafter. The test sets in this package are provided in both formats. Please contact mt_poc@nist.gov with questions. Package Contents ---------------- README.txt This file. Evaluation plan: NISTOpenMT09EvalPlan_v2d.pdf Test sets: Arabic-to-English, Current test Urdu-to-English, Current test Scoring utility: mteval-v13a-20091001.tar.gz DTD: mteval-xml-v1.3.dtd Data Set Statistics ------------------- Data genres: nw = newswire wb = web data Test set Genre Source/Ref Documents Segments Tokens MT09_Arabic-to-English nw source 68 586 15845 MT09_Arabic-to-English nw reference01 68 586 21470 MT09_Arabic-to-English nw reference02 68 586 21389 MT09_Arabic-to-English nw reference03 68 586 20337 MT09_Arabic-to-English nw reference04 68 586 20534 MT09_Arabic-to-English wb source 67 727 15102 MT09_Arabic-to-English wb reference01 67 727 21574 MT09_Arabic-to-English wb reference02 67 727 20933 MT09_Arabic-to-English wb reference03 67 727 22242 MT09_Arabic-to-English wb reference04 67 727 21301 MT09_Urdu-to-English nw source 72 764 20167 MT09_Urdu-to-English nw reference01 72 764 17560 MT09_Urdu-to-English nw reference02 72 764 18229 MT09_Urdu-to-English nw reference03 72 764 16744 MT09_Urdu-to-English nw reference04 72 764 16078 MT09_Urdu-to-English wb source 166 1028 20016 MT09_Urdu-to-English wb reference01 166 1028 19022 MT09_Urdu-to-English wb reference02 166 1028 19441 MT09_Urdu-to-English wb reference03 166 1028 18957 MT09_Urdu-to-English wb reference04 166 1028 18115 The token counts for Chinese data are "character" counts, which were obtained by counting tokens matching the UNICODE-based regular expression "\w". The token counts for all other languages included here are "word" counts, which were obtained by counting tokens matching the UNICODE-based regular expression "\w+". The Python "re" module was used to obtain these counts. NOTE: At the time this package was prepared, the following documents from the Arabic-to-English test set were sequestered for purposes of the DARPA GALE program, to remain sequestered until the end of the program: AAW_ARB_20070601.0045-S1 AAW_ARB_20070625.0007-S1 AAW_ARB_20070626.0033-S1 AFP_ARB_20070602.0055-S1 AFP_ARB_20070610.0044-S1 AFP_ARB_20070621.0095-S1 AFP_ARB_20070622.0033-S1 AHR_ARB_20070606.0027-S1 AHR_ARB_20070613.0053-S2 AHR_ARB_20070626.0039-S1 HYT_ARB_20070603.0063-S1 HYT_ARB_20070616.0026-S1 HYT_ARB_20070624.0076-S1 XIN_ARB_20070627.0192-S1 arb-NG-2-76513-7428240-S1 arb-NG-3-113171-8736074-S1 arb-NG-31-125296-7108159-S1 arb-WL-1-152155-8412214-S1 arb-WL-1-152395-8439233-S1 arb-WL-1-152503-8412899-S1 arb-WL-1-152503-8412907-S1 arb-WL-1-152602-7790697-S1 arb-WL-1-152768-8431180-S1 arb-WL-1-152788-8472519-S1 arb-WL-1-152938-7795359-S1 arb-WL-1-153200-7246734-S1 arb-WL-1-154596-7271776-S1 arb-WL-1-154914-7769518-S1