NIST Open Machine Translation 2009 Evaluation Sets
==================================================

This set contains the evaluation sets (source data and human reference 
translations), DTDs, scoring software, and evaluation plans from the Current 
tests (Progress tests are not included) of the 

NIST Open Machine Translation 2009 Evaluation.

Please refer to the evaluation plan included in this package for details 
on how the evaluation was run.

A test set consists of two files, a source and a reference file.  Each 
reference file contains four independent translations of the data set.  The 
evaluation year, source language, test set (which, by default, is "evalset"), 
version of the data, and source vs. reference file (with the latter being 
indicated by "-ref") are reflected in the file name.  A reference file 
contains four independent reference translations unless noted otherwise under 
"Package Contents" below.

DARPA TIDES MT and NIST OpenMT evaluations used SGML-formatted test data until 
2008 and XML-formatted test data thereafter.  The test sets in this package 
are provided in both formats.

Please contact mt_poc@nist.gov with questions.


Package Contents
----------------

README.txt
This file.

Evaluation plan:
NISTOpenMT09EvalPlan_v2d.pdf

Test sets:
Arabic-to-English, Current test
Urdu-to-English, Current test

Scoring utility:
mteval-v13a-20091001.tar.gz

DTD:
mteval-xml-v1.3.dtd


Data Set Statistics
-------------------

Data genres:
nw = newswire
wb = web data

Test set		Genre	Source/Ref	Documents	Segments	Tokens
MT09_Arabic-to-English	nw	source		68		586		15845
MT09_Arabic-to-English	nw	reference01	68		586		21470
MT09_Arabic-to-English	nw	reference02	68		586		21389
MT09_Arabic-to-English	nw	reference03	68		586		20337
MT09_Arabic-to-English	nw	reference04	68		586		20534
MT09_Arabic-to-English	wb	source		67		727		15102
MT09_Arabic-to-English	wb	reference01	67		727		21574
MT09_Arabic-to-English	wb	reference02	67		727		20933
MT09_Arabic-to-English	wb	reference03	67		727		22242
MT09_Arabic-to-English	wb	reference04	67		727		21301
MT09_Urdu-to-English	nw	source		72		764		20167
MT09_Urdu-to-English	nw	reference01	72		764		17560
MT09_Urdu-to-English	nw	reference02	72		764		18229
MT09_Urdu-to-English	nw	reference03	72		764		16744
MT09_Urdu-to-English	nw	reference04	72		764		16078
MT09_Urdu-to-English	wb	source		166		1028		20016
MT09_Urdu-to-English	wb	reference01	166		1028		19022
MT09_Urdu-to-English	wb	reference02	166		1028		19441
MT09_Urdu-to-English	wb	reference03	166		1028		18957
MT09_Urdu-to-English	wb	reference04	166		1028		18115

The token counts for Chinese data are "character" counts, which were obtained 
by counting tokens matching the UNICODE-based regular expression "\w".
The token counts for all other languages included here are "word" counts, 
which were obtained by counting tokens matching the UNICODE-based regular 
expression "\w+".
The Python "re" module was used to obtain these counts.

NOTE:
At the time this package was prepared, the following documents from the 
Arabic-to-English test set were sequestered for purposes of the DARPA GALE 
program, to remain sequestered until the end of the program:
AAW_ARB_20070601.0045-S1
AAW_ARB_20070625.0007-S1
AAW_ARB_20070626.0033-S1
AFP_ARB_20070602.0055-S1
AFP_ARB_20070610.0044-S1
AFP_ARB_20070621.0095-S1
AFP_ARB_20070622.0033-S1
AHR_ARB_20070606.0027-S1
AHR_ARB_20070613.0053-S2
AHR_ARB_20070626.0039-S1
HYT_ARB_20070603.0063-S1
HYT_ARB_20070616.0026-S1
HYT_ARB_20070624.0076-S1
XIN_ARB_20070627.0192-S1
arb-NG-2-76513-7428240-S1
arb-NG-3-113171-8736074-S1
arb-NG-31-125296-7108159-S1
arb-WL-1-152155-8412214-S1
arb-WL-1-152395-8439233-S1
arb-WL-1-152503-8412899-S1
arb-WL-1-152503-8412907-S1
arb-WL-1-152602-7790697-S1
arb-WL-1-152768-8431180-S1
arb-WL-1-152788-8472519-S1
arb-WL-1-152938-7795359-S1
arb-WL-1-153200-7246734-S1
arb-WL-1-154596-7271776-S1
arb-WL-1-154914-7769518-S1