MultiTACRED

Item Name: MultiTACRED
Author(s): Leonhard Hennig, Philippe Thomas, Sebastian Möller
LDC Catalog No.: LDC2024T09
ISLRN: 754-937-284-790-9
DOI: https://doi.org/10.35111/hcnt-7g66
Release Date: October 15, 2024
Member Year(s): 2024
DCMI Type(s): Text
Data Source(s): newswire, web collection
Project(s): TAC
Application(s): machine translation, relation extraction
Language(s): English, Arabic, German, Spanish, Finnish, French, Hindi, Hungarian, Japanese, Polish, Russian, Turkish, Mandarin Chinese
Language ID(s): eng, ara, deu, spa, fin, fra, hin, hun, jpn, pol, rus, tur, cmn
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2024T09 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Hennig, Leonhard, Philippe Thomas, and Sebastian Möller. MultiTACRED LDC2024T09. Web Download. Philadelphia: Linguistic Data Consortium, 2024.
Related Works: View

Introduction

MultiTACRED was developed by the German Research Center for Artificial Intelligence (DFKI) Speech and Language Technology Lab and is a machine translation of TAC Relation Extraction Dataset (LDC2018T24) (TACRED) into twelve languages with projected entity annotations. TACRED is a large-scale relation extraction dataset containing 106,264 examples built over English newswire and web text used in the NIST TAC KBP English slot filling evaluations during the period 2009-2014. The training and evaluation data for the TAC KBP slot filling tasks was developed by the Linguistic Data Consortium.

Data

TACRED training, development and test splits were translated into Arabic, Chinese, Finnish, French, German, Hindi,  Hungarian, Japanese, Polish, Russian, Spanish, and Turkish using  DeepL or Google Translate. The test split was back-translated into English to generate machine-translated English test data.

TACRED annotations are specified by token offsets. For translation, tokens were concatenated with white space, and the entity offsets were converted into XML-style markers to denote argument.

Data is presented in JSON format encoded in UTF-8.

Samples

Please view the following samples:

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee