NIST Open Machine Translation 2008 to 2012 Progress Test Sets
=============================================================

This set contains the evaluation sets (source data and human reference 
translations), DTD, scoring software, and evaluation plans for the 
Arabic-to-English and Chinese-to-English Progress tests of the NIST Open 
Machine Translation 2008, 2009, and 2012 Evaluations.  The test data remained 
unseen between evaluations and was reused unchanged each time.  The test sets 
consist of Newswire and Web data from July 2007.

Please refer to the evaluation plans included in this package for more details.

A test set consists of two files, a source and a reference file.  Each 
reference file contains four independent human reference translations of the 
source data.

The test sets in this package are in XML format compliant with the included 
DTD.

Please contact mt_poc@nist.gov with questions.
Please visit the NIST OpenMT website, 
http://www.nist.gov/itl/iad/mig/openmt.cfm, for general information on the 
NIST OpenMT evaluations.


Package Contents
----------------

README.txt - this file

Evaluation plans:
OpenMT08_EvalPlan.pdf
OpenMT09_EvalPlan.pdf
OpenMT12_EvalPlan.pdf

Scoring utility: mteval-v13a-20091001.tar.gz

DTD: mteval-xml-v1.6.dtd

Test sets (src = source, ref = human reference translations):
Arabic-to-English: OpenMT08-12_Progress_ara2eng-[src|ref].xml
Chinese-to-English: OpenMT08-12_Progress_chi2eng-[src|ref].xml


Data Set Statistics
-------------------

Data genres:
nw = newswire
wb = web data

Source	Genre	Documents	Segments	Source tokens
Arabic	nw	84		784		20039
Arabic	wb	51		594		14793
Chinese	nw	82		688		26923
Chinese	wb	40		682		19112

The token counts for Chinese data are "character" counts, which were obtained 
by counting tokens matching the UNICODE-based regular expression "\w".  The 
token counts for all other languages included here are "word" counts, which 
were obtained by counting tokens matching the UNICODE-based regular 
expression "\w+".  The Python "re" module was used to obtain these counts.