Home › Language Resources › Data

NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets

Item Name:	NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets
Author(s):	NIST Multimodal Information Group
LDC Catalog No.:	LDC2013T07
ISBN:	1-58563-640-1
ISLRN:	112-444-010-598-0
DOI:	https://doi.org/10.35111/xh7k-8m27
Release Date:	April 15, 2013
Member Year(s):	2013
DCMI Type(s):	Text
Data Source(s):	web collection, newswire
Project(s):	NIST MT
Application(s):	machine translation
Language(s):	English, Mandarin Chinese, Arabic, Chinese
Language ID(s):	eng, cmn, ara, zho
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2013T07 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	NIST Multimodal Information Group. NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets LDC2013T07. Web Download. Philadelphia: Linguistic Data Consortium, 2013.
Related Works: Hide	View hasAnnotation LDC2013T18 Semantic Textual Similarity (STS) 2013 Machine Translation LDC2014T09 HyTER Networks of Selected OpenMT08/09 Sentences LDC2014T12 Abstract Meaning Representation (AMR) Annotation Release 1.0 LDC2017T10 Abstract Meaning Representation (AMR) Annotation Release 2.0 LDC2020T02 Abstract Meaning Representation (AMR) Annotation Release 3.0 isSimilarWith LDC2010T11 NIST 2003 Open Machine Translation (OpenMT) Evaluation LDC2010T12 NIST 2004 Open Machine Translation (OpenMT) Evaluation LDC2010T14 NIST 2005 Open Machine Translation (OpenMT) Evaluation LDC2010T17 NIST 2006 Open Machine Translation (OpenMT) Evaluation LDC2010T21 NIST 2008 Open Machine Translation (OpenMT) Evaluation LDC2010T23 NIST 2009 Open Machine Translation (OpenMT) Evaluation LDC2013T03 NIST 2012 Open Machine Translation (OpenMT) Evaluation LDC2014T02 NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source

Introduction

NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets was developed by NIST Multimodal Information Group. This release contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plans for the Arabic-to-English and Chinese-to-English progress test sets for the NIST OpenMT 2008, 2009, and 2012 evaluations. The test data remained unseen between evaluations and was reused unchanged each time. The package was compiled, and scoring software was developed, at NIST, making use of Chinese and Arabic newswire and web data and reference translations collected and developed by the Linguistic Data Consortium (LDC).

The objective of the OpenMT evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies -- technologies that translate text between human languages. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original.

The MT evaluation series started in 2001 as part of the DARPA TIDES (Translingual Information Detection, Extraction) program. Beginning with the 2006 evaluation, the evaluations have been driven and coordinated by NIST as NIST OpenMT. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities in MT. The OpenMT evaluations are intended to be of interest to all researchers working on the general problem of automatic translation between human languages. To this end, they are designed to be simple, to focus on core technology issues and to be fully supported.

For more general information about the NIST OpenMT evaluations, please refer to the NIST OpenMT website.

This evaluation kit includes a single Perl script (mteval-v13a.pl) that may be used to produce a translation quality score for one (or more) MT systems. The script works by comparing the system output translation with a set of (expert) reference translations of the same source text. Comparison is based on finding sequences of words in the reference translations that match word sequences in the system output translation.

LDC has released the following associated corpora:

Data

This release contains 2,748 documents with corresponding source and reference files, the latter of which contains four independent human reference translations of the source data. The source data is comprised of Arabic and Chinese newswire and web data collected by LDC in 2007. The table below displays statistics by source, genre, documents, segments and source tokens.

Source	Genre	Documents	Segments	Source Tokens
Arabic	Newswire	84	784	20039
Arabic	Web Data	51	594	14793
Chinese	Newswire	82	688	26923
Chinese	Web Data	40	682	19112

The token counts for Chinese data are character counts, which were obtained by counting tokens matching the UNICODE-based regular expression w. The Python re module was used to obtain those counts.

The data in this package are in XML format compliant with the included DTD.

Samples

Please consult the following source sample and translation sample.

Updates

None at this time.

Copyright

Portions © 2007 Agence France Presse, Al-Ahram, Al Hayat, Al-Quds Al-Arabi, Asharq Al-Awsat, An Nahar, Assabah, China Military Online, Chinanews.com, Guangming Daily, Xinhua News Agency, © 2007, 2013 Trustees of the University of Pennsylvania