2008/2010 NIST Metrics for Machine Translation (MetricsMaTr) GALE Evaluation Set, Linguistic Data Consortium (LDC) catalog number LDC2011T05 and isbn 1-58563-575-8, is a package containing source data, reference translations, machine translations and associated human judgments used in the NIST 2008 and 2010 MetricsMaTr evaluations. The package was compiled by researchers at NIST, making use of Arabic and Chinese broadcast, newswire and web data and reference translations collected and developed by LDC for Phase 2 and Phase 2.5 of the DARPA GALE program.
NIST MetricsMaTr is a series of research challenge events for machine translation (MT) metrology, promoting the development of innovative MT metrics that correlate highly with human assessments of MT quality. Participants submit their metrics to NIST (National Institute of Standards and Technology). NIST runs those metrics on certain held-back test data for which it has human assessments measuring quality and then calculates correlations between the automatic metric scores and the human assessments. Specifically, the goals of MetricsMATR are: to inform other MT technology evaluation campaigns and conferences with regard to improved metrology to establish an infrastructure that encourages the development of innovative metrics to build a diverse community that will bring new perspectives to MT metrology research and to provide a forum for MT metrology discussion and for establishing future directions of MT metrology.
The first MetricsMaTr challenge was held in 2008 the development data from the 2008 program is available from LDC, 2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data LDC2009T05. The MetricsMaTr10 evaluation plan is included in this release.
This release contains 149 documents with corresponding reference translations (Arabic-to-English and Chinese-to-English), system translations and human assessments. The human assessments include the following: Adequacy7 (a 7-point scale for judging the meaning of a system translation with respect to the reference translation) Adequacy Yes/No (whether the given system segment meant essentially the same as the reference translation) Preference (the judges preference between two candidate translations when compared to a human reference translation) and HTER (Human Targeted Error Rate, human edits to a system translation to have the same meaning as a reference translation).
Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2011T05.
This work is supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-003. The content of this publication does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred.
Portions © 2006 Abu Dhabi TV, Agence France Presse, Al-Ahram, Al Alam News Channel, Al Arabiya, Al Hayat, Al Iraqiyah, Al Quds-Al Arabi, An Nahar, Asharq al-Awsat, China Central TV, China Military Online, Chinanews.com, Guangming Daily, Kuwait TV, New Tang Dynasty TV, Nile TV, PAC, Ltd, Peoples Daily Online, Phoenix TV, Syria TV, Xinhua News Agency, © 2006, 2011 Trustees of the University of Pennsylvania Contact: firstname.lastname@example.org © 2011 Linguistic Data Consortium , Trustees of the University of Pennsylvania . All Rights Reserved.