NIST Open Machine Translation 2008 to 2012 Progress Test Sets ============================================================= This set contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plans for the Arabic-to-English and Chinese-to-English Progress tests of the NIST Open Machine Translation 2008, 2009, and 2012 Evaluations. The test data remained unseen between evaluations and was reused unchanged each time. The test sets consist of Newswire and Web data from July 2007. Please refer to the evaluation plans included in this package for more details. A test set consists of two files, a source and a reference file. Each reference file contains four independent human reference translations of the source data. The test sets in this package are in XML format compliant with the included DTD. Please contact mt_poc@nist.gov with questions. Please visit the NIST OpenMT website, http://www.nist.gov/itl/iad/mig/openmt.cfm, for general information on the NIST OpenMT evaluations. Package Contents ---------------- README.txt - this file Evaluation plans: OpenMT08_EvalPlan.pdf OpenMT09_EvalPlan.pdf OpenMT12_EvalPlan.pdf Scoring utility: mteval-v13a-20091001.tar.gz DTD: mteval-xml-v1.6.dtd Test sets (src = source, ref = human reference translations): Arabic-to-English: OpenMT08-12_Progress_ara2eng-[src|ref].xml Chinese-to-English: OpenMT08-12_Progress_chi2eng-[src|ref].xml Data Set Statistics ------------------- Data genres: nw = newswire wb = web data Source Genre Documents Segments Source tokens Arabic nw 84 784 20039 Arabic wb 51 594 14793 Chinese nw 82 688 26923 Chinese wb 40 682 19112 The token counts for Chinese data are "character" counts, which were obtained by counting tokens matching the UNICODE-based regular expression "\w". The token counts for all other languages included here are "word" counts, which were obtained by counting tokens matching the UNICODE-based regular expression "\w+". The Python "re" module was used to obtain these counts.