DARPA TIDES Machine Translation 2005 Evaluation Sets
====================================================

This set contains the evaluation sets (source data and human reference 
translations), DTDs, scoring software, and evaluation plans from the 

DARPA TIDES Machine Translation 2005 Evaluation.

Please refer to the evaluation plan included in this package for details 
on how the evaluation was run.

A test set consists of two files, a source and a reference file.  Each 
reference file contains four independent translations of the data set.  The 
evaluation year, source language, test set (which, by default, is "evalset"), 
version of the data, and source vs. reference file (with the latter being 
indicated by "-ref") are reflected in the file name.   A reference file
contains four independent reference translations unless noted otherwise under
"Package Contents" below.

DARPA TIDES MT and NIST OpenMT evaluations used SGML-formatted test data until 
2008 and XML-formatted test data thereafter.  The test sets in this package 
are provided in both formats.

Please contact mt_poc@nist.gov with questions.


Package Contents
----------------

README.txt
This file.

Evaluation plan:
DARPATIDESMT05EvalPlan_v1-1.pdf

Test sets:
Arabic-to-English
Chinese-to-English

Scoring utility:
mteval-v11b-2008-01-23.tar.gz

DTD:
mteval-v1.1.dtd


Data Set Statistics
-------------------

Data genres:
nw = newswire

Test set		Genre	Source/Ref	Documents	Segments	Tokens
MT05_Arabic-to-English	nw	source		100		1056		25700
MT05_Arabic-to-English	nw	ahd		100		1056		31973
MT05_Arabic-to-English	nw	ahn		100		1056		31499
MT05_Arabic-to-English	nw	ahp		100		1056		34138
MT05_Arabic-to-English	nw	ahq		100		1056		32430
MT05_Chinese-to-English	nw	source		100		1082		47444
MT05_Chinese-to-English	nw	chc		100		1082		31742
MT05_Chinese-to-English	nw	chf		100		1082		31537
MT05_Chinese-to-English	nw	chg		100		1082		32132
MT05_Chinese-to-English	nw	chh		100		1082		31421

The token counts for Chinese data are "character" counts, which were obtained 
by counting tokens matching the UNICODE-based regular expression "\w".
The token counts for all other languages included here are "word" counts, 
which were obtained by counting tokens matching the UNICODE-based regular 
expression "\w+".
The Python "re" module was used to obtain these counts.