DARPA TIDES Machine Translation 2004 Evaluation Sets
====================================================

This set contains the evaluation sets (source data and human reference 
translations), DTDs, scoring software, and evaluation plans from the 

DARPA TIDES Machine Translation 2004 Evaluation.

Please refer to the evaluation plan included in this package for details 
on how the evaluation was run.

A test set consists of two files, a source and a reference file.  Each 
reference file contains four independent translations of the data set.  The 
evaluation year, source language, test set (which, by default, is "evalset"), 
version of the data, and source vs. reference file (with the latter being 
indicated by "-ref") are reflected in the file name.  A reference file
contains four independent reference translations unless noted otherwise under
"Package Contents" below.

DARPA TIDES MT and NIST OpenMT evaluations used SGML-formatted test data until 
2008 and XML-formatted test data thereafter.  The test sets in this package 
are provided in both formats.

Please contact mt_poc@nist.gov with questions.


Package Contents
----------------

README.txt
This file.

Evaluation plan:
DARPATIDESMT04EvalPlan_v2-1.pdf

Test sets:
Arabic-to-English
Chinese-to-English

DTD:
mteval-v1.1.dtd

Scoring utility:
mteval-v11a.pl


Data Set Statistics
-------------------

Data genres:
nw = newswire
ps = prepared speech

Test set		Genre	Source/Ref	Documents	Segments	Tokens
MT04_Arabic-to-English	nw	source		150		1075		24257
MT04_Arabic-to-English	nw	ahd		150		1075		30892
MT04_Arabic-to-English	nw	ahi		150		1075		32224
MT04_Arabic-to-English	nw	ahj		150		1075		32656
MT04_Arabic-to-English	nw	ahm		150		1075		32411
MT04_Arabic-to-English	ps	source		50		278		8237
MT04_Arabic-to-English	ps	ahd		50		278		11612
MT04_Arabic-to-English	ps	ahi		50		278		11969
MT04_Arabic-to-English	ps	ahj		50		278		11875
MT04_Arabic-to-English	ps	ahm		50		278		12230
MT04_Chinese-to-English	nw	source		150		1350		56697
MT04_Chinese-to-English	nw	cha		150		1350		39970
MT04_Chinese-to-English	nw	chc		150		1350		39069
MT04_Chinese-to-English	nw	che		150		1350		39141
MT04_Chinese-to-English	nw	chf		150		1350		40304
MT04_Chinese-to-English	ps	source		50		438		18707
MT04_Chinese-to-English	ps	cha		50		438		14764
MT04_Chinese-to-English	ps	chc		50		438		13956
MT04_Chinese-to-English	ps	che		50		438		13943
MT04_Chinese-to-English	ps	chf		50		438		14359

The token counts for Chinese data are "character" counts, which were obtained 
by counting tokens matching the UNICODE-based regular expression "\w".
The token counts for all other languages included here are "word" counts, 
which were obtained by counting tokens matching the UNICODE-based regular 
expression "\w+".
The Python "re" module was used to obtain these counts.