NIST Open Machine Translation 2006 Evaluation Sets
==================================================

This set contains the evaluation sets (source data and human reference 
translations), DTDs, scoring software, and evaluation plans from the 

NIST Open Machine Translation 2006 Evaluation.

Please refer to the evaluation plan included in this package for details 
on how the evaluation was run.

A test set consists of two files, a source and a reference file.  Each 
reference file contains four independent translations of the data set.  The 
evaluation year, source language, test set (which, by default, is "evalset"), 
version of the data, and source vs. reference file (with the latter being 
indicated by "-ref") are reflected in the file name.  A reference file
contains four independent reference translations unless noted otherwise under
"Package Contents" below.

DARPA TIDES MT and NIST OpenMT evaluations used SGML-formatted test data until 
2008 and XML-formatted test data thereafter.  The test sets in this package 
are provided in both formats.

Please contact mt_poc@nist.gov with questions.


Package Contents
----------------

README.txt
This file.

Evaluation plan:
NISTOpenMT06EvalPlan_v4.pdf

Test sets:
Arabic-to-English, NIST part
Arabic-to-English, GALE part (2 reference translations)
Chinese-to-English, NIST part
Chinese-to-English, GALE part (2 reference translations)

Scoring utility:
mteval-v11b-2008-01-23.tar.gz

DTD:
mteval-v1.2.dtd


Data Set Statistics
-------------------

Data genres:
bc = broadcast conversation
bn = broadcast news
nw = newswire
wb = web data

Test set			Genre	Source/Ref		Documents	Segments	Tokens
MT06_Arabic-to-English_GALE	bc	source			13		529		10743
MT06_Arabic-to-English_GALE	bc	nvtc			13		529		15166
MT06_Arabic-to-English_GALE	bn	source			24		956		11386
MT06_Arabic-to-English_GALE	bn	nvtc			24		956		16272
MT06_Arabic-to-English_GALE	nw	source			51		474		10431
MT06_Arabic-to-English_GALE	nw	nvtc			51		474		13935
MT06_Arabic-to-English_GALE	wb	source			29		527		8705
MT06_Arabic-to-English_GALE	wb	nvtc			29		527		12904
MT06_Arabic-to-English_NIST	bn	source			13		268		5886
MT06_Arabic-to-English_NIST	bn	ahn			13		268		8056
MT06_Arabic-to-English_NIST	bn	ahp			13		268		8477
MT06_Arabic-to-English_NIST	bn	ahq			13		268		7926
MT06_Arabic-to-English_NIST	bn	ahr			13		268		8058
MT06_Arabic-to-English_NIST	nw	source			73		765		20649
MT06_Arabic-to-English_NIST	nw	ahn			73		765		26143
MT06_Arabic-to-English_NIST	nw	ahp			73		765		28139
MT06_Arabic-to-English_NIST	nw	ahq			73		765		27145
MT06_Arabic-to-English_NIST	nw	ahr			73		765		26896
MT06_Arabic-to-English_NIST	wb	source			18		764		9960
MT06_Arabic-to-English_NIST	wb	ahn			18		764		13877
MT06_Arabic-to-English_NIST	wb	ahp			18		764		14840
MT06_Arabic-to-English_NIST	wb	ahq			18		764		14108
MT06_Arabic-to-English_NIST	wb	ahr			18		764		13519
MT06_Chinese-to-English_GALE	bc	source			11		979		18501
MT06_Chinese-to-English_GALE	bc	nvtc			11		979		13083
MT06_Chinese-to-English_GALE	bn	source			18		518		19060
MT06_Chinese-to-English_GALE	bn	nvtc			18		518		14155
MT06_Chinese-to-English_GALE	nw	source			36		364		14112
MT06_Chinese-to-English_GALE	nw	nvtc			36		364		10469
MT06_Chinese-to-English_GALE	wb	source			19		415		13049
MT06_Chinese-to-English_GALE	wb	nvtc			19		415		10571
MT06_Chinese-to-English_NIST	bn	source			15		565		17906
MT06_Chinese-to-English_NIST	bn	chf			15		565		11845
MT06_Chinese-to-English_NIST	bn	chg			15		565		11639
MT06_Chinese-to-English_NIST	bn	chh			15		565		12152
MT06_Chinese-to-English_NIST	bn	chi			15		565		12128
MT06_Chinese-to-English_NIST	nw	source			52		616		27705
MT06_Chinese-to-English_NIST	nw	chf			52		616		20192
MT06_Chinese-to-English_NIST	nw	chg			52		616		21467
MT06_Chinese-to-English_NIST	nw	chh			52		616		20055
MT06_Chinese-to-English_NIST	nw	chi			52		616		21181
MT06_Chinese-to-English_NIST	wb	source			12		483		13452
MT06_Chinese-to-English_NIST	wb	chf			12		483		9833
MT06_Chinese-to-English_NIST	wb	chg			12		483		10099
MT06_Chinese-to-English_NIST	wb	chh			12		483		10213
MT06_Chinese-to-English_NIST	wb	chi			12		483		10122

The token counts for Chinese data are "character" counts, which were obtained 
by counting tokens matching the UNICODE-based regular expression "\w".
The token counts for all other languages included here are "word" counts, 
which were obtained by counting tokens matching the UNICODE-based regular 
expression "\w+".
The Python "re" module was used to obtain these counts.