2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data
|2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data
|Mark Przybocki, Kay Peterson, Sébastien Bronsart
|LDC Catalog No.:
|March 17, 2009
|NIST MT, GALE
|natural language processing, machine translation, machine learning
|English, Standard Arabic, Arabic
|eng, arb, ara
LDC User Agreement for Non-Members
|Subscription & Standard Members, and Non-Members
|Przybocki, Mark, Kay Peterson, and Sébastien Bronsart. 2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data LDC2009T05. Web Download. Philadelphia: Linguistic Data Consortium, 2009.
NIST MetricsMATR is a series of research challenge events for machine translation (MT) metrology, promoting the development of innovative, even revolutionary, MT metrics that correlate highly with human assessments of MT quality. In this program, participants submit their metrics to the National Institute of Standards and Technology (NIST). NIST runs those metrics on certain held-back test data for which it has human assessments measuring quality and then calculates correlations between the automatic metric scores and the human assessments.
This release contains the development data received by participants in NIST Metrics for Machine Translation 2008 Evaluation (MetricsMATR08). Specifically, this corpus is comprised of a subset of the materials used in the NIST Open MT06 evaluation and includes human reference translations, system translations, and human assessments of adequacy and preference. The source data consists of twenty-five Arabic language newswire documents with a total of 249 segments. The data in each segment includes four human reference translations in English and system translations from eight different MT06 machine translation systems. In addition to the data and reference translations, this release inlcudes software tools for evaluation and reporting and documentation describing how the human assessments were obtained and how they are represented in the data. The evaluation plan contains further information and rules on the use of this data.
The MetricsMATR program seeks to overcome several drawbacks to the methods employed for the evaluation of MT technology. Currently, automatic metrics have not yet proved able to predict the usefulness and reliability of MT technologies with confidence. Nor have automatic metrics demonstrated that they are meaningful in target languages other than English. Human assessments, however, are expensive, slow, subjective and difficult to standardize. These problems, and the need to overcome them through the development of improved automatic (or even semi-automatic) metrics, have been a constant point of discussion at past NIST MT evaluation events. MetricsMATR aims to provide a platform to address these shortcomings. Specifically, the goals of MetricsMATR are:
- To inform other MT technology evaluation campaigns and conferences with regard to improved metrology.
- To establish an infrastructure that encourages the development of innovative metrics.
- To build a diverse community that will bring new perspectives to MT metrology research.
- To provide a forum for MT metrology discussion and for establishing future directions of MT metrology.
The MetricsMATR08 development data set released here is reflective of the test data set only to a degree; the evaluation data set contains more varied data -- from more genres, more source languages, more systems and different evaluations -- than this development data set. There are also more types of human assessments for the test data. The MetricsMATR08 test data remains unseen to allow for repeated use as test data.
The software used for obtaining the human judgments included in this data set is the same software used for the NIST Open MT08 human assessments. It includes a description of the adequacy and preference assessment tasks and the instructions given to the judges. All segments assessed were judged by two independent judges. Adequacy judgments were performed for all segments of each document. Preference judgments were performed for the first four segments of each document such that full pair-wise comparisons between all eight MT systems were obtained. All judgments were performed against only one reference translation. The score represents an adjudicated score over the two individual judgments.
The official results of MetricsMATR08 on the test data for the metrics submitted to MetricsMATR08 are publicly available. NIST performed the same analyses on the MetricsMATR08 development data after the evaluation. These results are not publicly available, but will likely be available on request in the future by contacting firstname.lastname@example.org.
For an example of the data in this release, please examine these sample scores and judgments.