GALE MetricsMaTr 2008 and 2010 Evaluation data ============================================== This package contains the GALE portions of the evaluation data used in NIST's Metrics for Machine Translation (MetricsMaTr) 2008 and 2010 evaluations. This consists of GALE phase 2 and phase 2.5 Arabic-to-English and Chinese-to-English source data, reference translations, machine translations, and associated human assessments in the following genres: BC (broadcast conversation) BN (broadcast news) NW (newswire) WB (web data) Included human assessments -------------------------- Adequacy7: For a given segment, judges selected a point on a 7-point scale (with 7 being the best score) to answer the question of how much of hte meaning expressed in the reference translation was also expressed in the system translation. Segments were presented in the order in which they appeared in the document. Each segment was assessed by two judges. The Adequacy7 scores represent the average of all judges' scores for a given segment. For document and system level values, an average of the segment level scores weighted by the number of reference tokens was used. Adequacy Yes/No: As a follow-up question to the Adequacy7 question, judges made a binary yes/no decision about whether the system translation for the given segment meant essentially the same as the reference translation. Each segment was assessed by two judges. The AdequacyYN scores represent the ratio of the number of "Yes" judgments for a given segment, to the total number of judgments, across all judges. Counts were aggregated to obtain document and system level ratios. Preference: Judges selected their preference between two candidate translations when compared to a human reference translation. Judges were presented with all possible pair-wise comparisons for a given segment, with segments being presented in the order in which they appeared in the document and with the order of system comparisons appearing randomly. Judges could select a preference for one of the two system translations, or they could choose no preference when the translations were equally good or equally bad. Each possible pairwise comparison of each segment was assessed by at least one judge. The target Preference scores represent the number of times a given system segment was preferred, divided by the total number of comparisons involving the system. Counts were aggregated to obtain document and system level ratios. HTER (Human Targeted Edit Rate): For the HTER annotations, a human assessor compared a system translation to a reference translation, and edited the system translation such that it would have the same meaning as the reference translation. This was emphasized to be done with as few edits as possible. The number of needed edits (insertions, deletions, substitutions, and shifts) was then measured automatically using Snover et al.'s (2006) TER (Translation Edit Rate) measure, version tercom.7.25, for the segment-level scores. Document- and system-level scores are unweighted averages of segment-level scores (NOTE: different from the calcuation used on the original MetricsMaTr set). Package contents ---------------- ./README.txt: this file ./mteval-xml-v1.5.dtd: the NIST MT XML DTD that the source, system, and reference translations comply with ./NISTMetricsMaTr10EvalPlan.pdf: the MetricsMaTr10 Evaluation plan The directories containing the source data, system translations, reference translations, and human assessments (GALEPHASE/Sourcelanguage-TargetLanguage/GENRE): ./GALEP2/Arabic-English/NW/ (22 documents, 207 segments) ./GALEP2/Arabic-English/WB/ (23 documents, 262 segments) ./GALEP2/Chinese-English/NW/ (25 documents, 183 segments) ./GALEP2/Chinese-English/WB/ (22 documents, 209 segments) ./GALEP25/Arabic-English/BN/ (15 documents, 159 segments) ./GALEP25/Chinese-English/BC/ (21 documents, 267 segments) ./GALEP25/Chinese-English/BN/ (21 documents, 221 segments) Each bottom-level directory contains: source.xml: the source data system.xml: the system translations (3 systems) reference.xml: the reference translation (1 reference) The following raw (non-consolidated, in case of multiple judgments per segment and translation) human assessment scores: Adequacy.csv (containing both Adequacy7 and AdequacyYN judgments) Preferences.csv (a score "3" indicates a "no preference" judgment) The following target_-[seg|doc|sys].scr files, which contain consolidated human judgment scores (in case of multiple judgments) for the segment, document, and system level: target_Adequacy7Average-doc.scr target_Adequacy7Average-seg.scr target_Adequacy7Average-sys.scr target_AdequacyYNRatio-doc.scr target_AdequacyYNRatio-seg.scr target_AdequacyYNRatio-sys.scr target_HTER-doc.scr target_HTER-seg.scr target_HTER-sys.scr target_Preferences-doc.scr target_Preferences-seg.scr target_Preferences-sys.scr The target human assessment files contain tab-separated information in the following fields: for segment level scores: testSetId systemId documentId segmentId score for document level scores: testSetId systemId documentId score for system level scores: testSetId systemId score