DARPA TIDES Machine Translation 2002 Evaluation Sets
====================================================

This set contains the evaluation sets (source data and human reference 
translations), DTDs, scoring software, and evaluation plans from the 

DARPA TIDES Machine Translation 2002 Evaluation.

Please refer to the evaluation plan included in this package for details 
on how the evaluation was run.

A test set consists of two files, a source and a reference file.  Each 
reference file contains four independent translations of the data set.  The 
evaluation year, source language, test set (which, by default, is "evalset"), 
version of the data, and source vs. reference file (with the latter being 
indicated by "-ref") are reflected in the file name.  A reference file
contains four independent reference translations unless noted otherwise under
"Package Contents" below.

DARPA TIDES MT and NIST OpenMT evaluations used SGML-formatted test data until 
2008 and XML-formatted test data thereafter.  The test sets in this package 
are provided in both formats.

Please contact mt_poc@nist.gov with questions.


Package Contents
----------------

README.txt
This file.

Evaluation plan:
DARPATIDESMT02EvalPlan_v1-3.pdf

Test sets:
Arabic-to-English
Chinese-to-English

Scoring utility:
mteval-kit-v09.tar.gz

DTD:
mteval-v1.1.dtd


Data Set Statistics
-------------------

Data genres:
nw = newswire

Test set		Genre	Source/Ref	Documents	Segments	Tokens
MT02_Arabic-to-English	nw	source		100		728		16428
MT02_Arabic-to-English	nw	ahd		100		728		20118
MT02_Arabic-to-English	nw	ahg		100		728		20130
MT02_Arabic-to-English	nw	ahh		100		728		19267
MT02_Arabic-to-English	nw	ahi		100		728		21431
MT02_Chinese-to-English	nw	source		100		878		36185
MT02_Chinese-to-English	nw	E01		100		878		25919
MT02_Chinese-to-English	nw	E02		100		878		24865
MT02_Chinese-to-English	nw	E03		100		878		23512
MT02_Chinese-to-English	nw	E04		100		878		22472

The token counts for Chinese data are "character" counts, which were obtained 
by counting tokens matching the UNICODE-based regular expression "\w".
The token counts for all other languages included here are "word" counts, 
which were obtained by counting tokens matching the UNICODE-based regular 
expression "\w+".
The Python "re" module was used to obtain these counts.