NIST Open Machine Translation 2012 Progress Tests ================================================= New Source Arabic, Chinese, Dari, Farsi, Korean =============================================== This set contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plan for the OpenMT12 test for Arabic, Chinese, Dari, Farsi, and Korean to English on a parallel data set. This set is based on a subset of the Arabic-to-English and Chinese-to-English Progress tests from the NIST Open Machine Translation 2008, 2009, and 2012 evaluations with new source data created based on the English human reference translation reference_1. The original data consists of Newswire and Web data from July 2007. The new source based on the reference translation data were created in five languages (Arabic, Chinese, Dari, Farsi, Korean) and in two styles: English-true": A more English-oriented translation; requires that the text reads well and does not use any idiomatic expressions in the foreign language to convey meaning, unless absolutely necessary "Foreign-true": A translation as close as possible to the foreign language, as if the text had originated in that language The new source data were created by the Defense Language Institute. Please refer to the evaluation plan included in this package for more details. A test set consists of two files, a source and a reference file. Each reference file contains four independent human reference translations of the source data. The test sets in this package are in XML format compliant with the included DTD. Please contact mt_poc@nist.gov with questions. Please visit the NIST OpenMT website, http://www.nist.gov/itl/iad/mig/openmt.cfm, for general information on the NIST OpenMT evaluations. Package Contents ---------------- README.txt - this file Evaluation plan: OpenMT12_EvalPlan.pdf Scoring utility: mteval-v13a-20091001.tar.gz DTD: mteval-xml-v1.6.dtd Test sets (src = source, ref = human reference translations): Arabic-to-English: OpenMT12_Current_ara2eng-[src-englishtrue|src-foreigntrue|ref-englishtrue|ref-foreigntrue].xml Chinese-to-English: OpenMT12_Current_chi2eng-[src-englishtrue|src-foreigntrue|ref-englishtrue|ref-foreigntrue].xml Dari-to-English: OpenMT12_Current_dar2eng-[src-englishtrue|src-foreigntrue|ref-englishtrue|ref-foreigntrue].xml Farsi-to-English: OpenMT12_Current_far2eng-[src-englishtrue|src-foreigntrue|ref-englishtrue|ref-foreigntrue].xml Korean-to-English: OpenMT12_Current_kor2eng-[src-englishtrue|src-foreigntrue|ref-englishtrue|ref-foreigntrue].xml Data Set Statistics ------------------- Data genres: nw = newswire wb = web data Source Genre Documents Segments Source tokens Arabic English-true nw 84 763 19902 Arabic English-true wb 59 774 17179 Arabic Foreign-true nw 84 763 19854 Arabic Foreign-true wb 59 774 17124 Chinese English-true nw 84 763 35939 Chinese English-true wb 59 774 32694 Chinese Foreign-true nw 84 763 36400 Chinese Foreign-true wb 59 774 32632 Dari English-true nw 84 763 25372 Dari English-true wb 59 774 22453 Dari Foreign-true nw 84 763 25025 Dari Foreign-true wb 59 774 23244 Farsi English-true nw 84 763 25266 Farsi English-true wb 59 774 22701 Farsi Foreign-true nw 84 763 25118 Farsi Foreign-true wb 59 774 22062 Korean English-true nw 84 763 16206 Korean English-true wb 59 774 15017 Korean Foreign-true nw 84 763 16236 Korean Foreign-true wb 59 774 15277 The token counts for Chinese data are "character" counts, which were obtained by counting tokens matching the UNICODE-based regular expression "\w". The token counts for all other languages included here are "word" counts, which were obtained by counting tokens matching the UNICODE-based regular expression "\w+". The Python "re" module was used to obtain these counts.