NIST Open Machine Translation 2006 Evaluation Sets ================================================== This set contains the evaluation sets (source data and human reference translations), DTDs, scoring software, and evaluation plans from the NIST Open Machine Translation 2006 Evaluation. Please refer to the evaluation plan included in this package for details on how the evaluation was run. A test set consists of two files, a source and a reference file. Each reference file contains four independent translations of the data set. The evaluation year, source language, test set (which, by default, is "evalset"), version of the data, and source vs. reference file (with the latter being indicated by "-ref") are reflected in the file name. A reference file contains four independent reference translations unless noted otherwise under "Package Contents" below. DARPA TIDES MT and NIST OpenMT evaluations used SGML-formatted test data until 2008 and XML-formatted test data thereafter. The test sets in this package are provided in both formats. Please contact mt_poc@nist.gov with questions. Package Contents ---------------- README.txt This file. Evaluation plan: NISTOpenMT06EvalPlan_v4.pdf Test sets: Arabic-to-English, NIST part Arabic-to-English, GALE part (2 reference translations) Chinese-to-English, NIST part Chinese-to-English, GALE part (2 reference translations) Scoring utility: mteval-v11b-2008-01-23.tar.gz DTD: mteval-v1.2.dtd Data Set Statistics ------------------- Data genres: bc = broadcast conversation bn = broadcast news nw = newswire wb = web data Test set Genre Source/Ref Documents Segments Tokens MT06_Arabic-to-English_GALE bc source 13 529 10743 MT06_Arabic-to-English_GALE bc nvtc 13 529 15166 MT06_Arabic-to-English_GALE bn source 24 956 11386 MT06_Arabic-to-English_GALE bn nvtc 24 956 16272 MT06_Arabic-to-English_GALE nw source 51 474 10431 MT06_Arabic-to-English_GALE nw nvtc 51 474 13935 MT06_Arabic-to-English_GALE wb source 29 527 8705 MT06_Arabic-to-English_GALE wb nvtc 29 527 12904 MT06_Arabic-to-English_NIST bn source 13 268 5886 MT06_Arabic-to-English_NIST bn ahn 13 268 8056 MT06_Arabic-to-English_NIST bn ahp 13 268 8477 MT06_Arabic-to-English_NIST bn ahq 13 268 7926 MT06_Arabic-to-English_NIST bn ahr 13 268 8058 MT06_Arabic-to-English_NIST nw source 73 765 20649 MT06_Arabic-to-English_NIST nw ahn 73 765 26143 MT06_Arabic-to-English_NIST nw ahp 73 765 28139 MT06_Arabic-to-English_NIST nw ahq 73 765 27145 MT06_Arabic-to-English_NIST nw ahr 73 765 26896 MT06_Arabic-to-English_NIST wb source 18 764 9960 MT06_Arabic-to-English_NIST wb ahn 18 764 13877 MT06_Arabic-to-English_NIST wb ahp 18 764 14840 MT06_Arabic-to-English_NIST wb ahq 18 764 14108 MT06_Arabic-to-English_NIST wb ahr 18 764 13519 MT06_Chinese-to-English_GALE bc source 11 979 18501 MT06_Chinese-to-English_GALE bc nvtc 11 979 13083 MT06_Chinese-to-English_GALE bn source 18 518 19060 MT06_Chinese-to-English_GALE bn nvtc 18 518 14155 MT06_Chinese-to-English_GALE nw source 36 364 14112 MT06_Chinese-to-English_GALE nw nvtc 36 364 10469 MT06_Chinese-to-English_GALE wb source 19 415 13049 MT06_Chinese-to-English_GALE wb nvtc 19 415 10571 MT06_Chinese-to-English_NIST bn source 15 565 17906 MT06_Chinese-to-English_NIST bn chf 15 565 11845 MT06_Chinese-to-English_NIST bn chg 15 565 11639 MT06_Chinese-to-English_NIST bn chh 15 565 12152 MT06_Chinese-to-English_NIST bn chi 15 565 12128 MT06_Chinese-to-English_NIST nw source 52 616 27705 MT06_Chinese-to-English_NIST nw chf 52 616 20192 MT06_Chinese-to-English_NIST nw chg 52 616 21467 MT06_Chinese-to-English_NIST nw chh 52 616 20055 MT06_Chinese-to-English_NIST nw chi 52 616 21181 MT06_Chinese-to-English_NIST wb source 12 483 13452 MT06_Chinese-to-English_NIST wb chf 12 483 9833 MT06_Chinese-to-English_NIST wb chg 12 483 10099 MT06_Chinese-to-English_NIST wb chh 12 483 10213 MT06_Chinese-to-English_NIST wb chi 12 483 10122 The token counts for Chinese data are "character" counts, which were obtained by counting tokens matching the UNICODE-based regular expression "\w". The token counts for all other languages included here are "word" counts, which were obtained by counting tokens matching the UNICODE-based regular expression "\w+". The Python "re" module was used to obtain these counts.