Home › Language Resources › Data

HyTER Networks of Selected OpenMT08/09 Sentences

Item Name:	HyTER Networks of Selected OpenMT08/09 Sentences
Author(s):	Markus Dreyer, Daniel Marcu
LDC Catalog No.:	LDC2014T09
ISBN:	1-58563-678-9
ISLRN:	811-846-772-709-6
DOI:	https://doi.org/10.35111/ed7d-z579
Release Date:	May 15, 2014
Member Year(s):	2014
DCMI Type(s):	Text
Data Source(s):	weblogs, newswire
Project(s):	NIST MT
Application(s):	machine translation
Language(s):	English, Mandarin Chinese, Arabic, Chinese
Language ID(s):	eng, cmn, ara, zho
License(s):	LDC User Agreement for Non-Members
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Dreyer, Markus, and Daniel Marcu. HyTER Networks of Selected OpenMT08/09 Sentences LDC2014T09. Web Download. Philadelphia: Linguistic Data Consortium, 2014.
Related Works: Hide	View isAnnotationOf LDC2013T07 NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets

Introduction

HyTER Networks of Selected OpenMT08/09 Progress Set Sentences was developed by SDL and contains HyTER (Hybrid Translation Edit Rate) networks for 102 selected source Arabic and Chinese sentences from OpenMT08 and OpenMT09 Progress Set data. HyTER is an evaluation metric based on large reference networks created by an annotation tool that allows users to develop an exponential number of correct translations for a given sentence. Reference networks can be used as a foundation for developing improved machine translation evaluation metrics and for automating the evaluation of human translation efficiency.

Data

The source material is comprised of Arabic and Chinese newswire and web data collected by LDC in 2007. Annotators created meaning-equivalent annotations under three annotation protocols. In the first protocol, foreign language native speakers built English networks starting from foreign language sentences. In the second, English native speakers built English networks from the best translation of a foreign language sentence as identified by NIST (National Institute of Standards and Technology). In the third protocol, English native speakers built English networks starting from the best translation, but those annotators also had access to three additional, independently produced human translations. Networks created by different annotators for each sentence were combined and evaluated.

This release includes the source sentences and four human reference translations produced by LDC in XML format, along with five machine translation system outputs representing a variety of system architectures and performance, and the human post-edited output of those systems also presented in XML.

Samples

Please view this FST sample and Reference XML sample.

Updates

None at this time.

Copyright

Portions © 2007 Agence France Presse, Al-Ahram, Al Hayat, Al-Quds Al-Arabi, Asharq Al-Awsat, An Nahar, Assabah, China Military Online,Chinanews.com, Guangming Daily, Xinhua News Agency, © 2007, 2013, 2014 Trustees of the University of Pennsylvania