GALE MetricsMaTr 2008 and 2010 Evaluation data
==============================================

This package contains the GALE portions of the evaluation data used in NIST's
Metrics for Machine Translation (MetricsMaTr) 2008 and 2010 evaluations.

This consists of GALE phase 2 and phase 2.5 Arabic-to-English and 
Chinese-to-English source data, reference translations, machine translations, 
and associated human assessments in the following genres:

BC (broadcast conversation)
BN (broadcast news)
NW (newswire)
WB (web data)


Included human assessments
--------------------------

Adequacy7:
For a given segment, judges selected a point on a 7-point scale (with 7 being 
the best score) to answer the question of how much of hte meaning expressed in 
the reference translation was also expressed in the system translation.  
Segments were presented in the order in which they appeared in the document.  
Each segment was assessed by two judges.
The Adequacy7 scores represent the average of all judges' scores for a given 
segment.  For document and system level values, an average of the segment 
level scores weighted by the number of reference tokens was used.

Adequacy Yes/No:
As a follow-up question to the Adequacy7 question, judges made a binary yes/no 
decision about whether the system translation for the given segment meant 
essentially the same as the reference translation.  Each segment was assessed 
by two judges.
The AdequacyYN scores represent the ratio of the number of "Yes" judgments for 
a given segment, to the total number of judgments, across all judges.  Counts 
were aggregated to obtain document and system level ratios.

Preference:
Judges selected their preference between two candidate translations when 
compared to a human reference translation.  Judges were presented with all 
possible pair-wise comparisons for a given segment, with segments being 
presented in the order in which they appeared in the document and with the 
order of system comparisons appearing randomly.  Judges could select a 
preference for one of the two system translations, or they could choose no 
preference when the translations were equally good or equally bad.  Each 
possible pairwise comparison of each segment was assessed by at least one 
judge.
The target Preference scores represent the number of times a given system 
segment was preferred, divided by the total number of comparisons involving 
the system. Counts were aggregated to obtain document and system level ratios.

HTER (Human Targeted Edit Rate):
For the HTER annotations, a human assessor compared a system translation to a 
reference translation, and edited the system translation such that it would 
have the same meaning as the reference translation. This was emphasized to be 
done with as few edits as possible. The number of needed edits (insertions, 
deletions, substitutions, and shifts) was then measured automatically using 
Snover et al.'s (2006) TER (Translation Edit Rate) measure, version 
tercom.7.25, for the segment-level scores.  Document- and system-level scores 
are unweighted averages of segment-level scores (NOTE: different from the 
calcuation used on the original MetricsMaTr set).


Package contents
----------------

./README.txt: this file

./mteval-xml-v1.5.dtd: the NIST MT XML DTD that the source, system, and 
reference translations comply with

./NISTMetricsMaTr10EvalPlan.pdf: the MetricsMaTr10 Evaluation plan

The directories containing the source data, system translations, reference 
translations, and human assessments 
(GALEPHASE/Sourcelanguage-TargetLanguage/GENRE):

./GALEP2/Arabic-English/NW/ (22 documents,  207 segments)
./GALEP2/Arabic-English/WB/ (23 documents,  262 segments)
./GALEP2/Chinese-English/NW/ (25 documents, 183 segments)
./GALEP2/Chinese-English/WB/ (22 documents, 209 segments)
./GALEP25/Arabic-English/BN/ (15 documents,  159 segments)
./GALEP25/Chinese-English/BC/ (21 documents, 267 segments)
./GALEP25/Chinese-English/BN/ (21 documents, 221 segments)

Each bottom-level directory contains:

source.xml: the source data
system.xml: the system translations (3 systems)
reference.xml: the reference translation (1 reference)

The following raw (non-consolidated, in case of multiple judgments per segment 
and translation) human assessment scores:

Adequacy.csv (containing both Adequacy7 and AdequacyYN judgments)
Preferences.csv (a score "3" indicates a "no preference" judgment)

The following target_<assessmentType>-[seg|doc|sys].scr files, which contain 
consolidated human judgment scores (in case of multiple judgments) for the 
segment, document, and system level:

target_Adequacy7Average-doc.scr
target_Adequacy7Average-seg.scr
target_Adequacy7Average-sys.scr
target_AdequacyYNRatio-doc.scr
target_AdequacyYNRatio-seg.scr
target_AdequacyYNRatio-sys.scr
target_HTER-doc.scr
target_HTER-seg.scr
target_HTER-sys.scr
target_Preferences-doc.scr
target_Preferences-seg.scr
target_Preferences-sys.scr

The target human assessment files contain tab-separated information in the 
following fields:
for segment level scores:
testSetId systemId documentId segmentId score
for document level scores:
testSetId systemId documentId score
for system level scores:
testSetId systemId score