Prague Czech-English Dependency Treebank 1.0

Item Name: Prague Czech-English Dependency Treebank 1.0
Author(s): Jan Curin, Martin Cmejrek, Jiří Havelka, Jan Hajič, Vladislav Kubon, Zdeněk Žabokrtský
LDC Catalog No.: LDC2004T25
ISBN: 1-58563-321-6
ISLRN: 557-838-231-104-8
DOI: https://doi.org/10.35111/yn25-st18
Release Date: November 19, 2004
Member Year(s): 2004
DCMI Type(s): Text
Data Source(s): dictionaries, newswire
Application(s): information extraction, information retrieval, language modeling, language teaching, machine translation, parsing, tagging
Language(s): Czech, English
Language ID(s): ces, eng
License(s): Prague Czech-English Dependency Treebank 1.0
Online Documentation: LDC2004T25 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Curin, Jan, et al. Prague Czech-English Dependency Treebank 1.0 LDC2004T25. Web Download. Philadelphia: Linguistic Data Consortium, 2004.
Related Works: View

Introduction

Prague Czech-English Dependency Treebank (PCEDT) 1.0 was produced by the Linguistic Data Consortium (LDC) and contains 74,600 parallel sentence in Czech and English, 21,600 of which are morphologically annotated and parsed into dependency structures. It also includes a large monolingual corpus of Czech with 2.4 million sentences and three dictionaries for translation between Czech and English. This corpus was developed at the Center for Computational Linguistics in cooperation with the Institute of Formal and Applied Linguistics.

PCEDT 1.0 is a corpus of Czech-English parallel resources suitable for experiments in machine translation, with a special emphasis on dependency-based (structural) translation (with evaluation data provided for Czech-to-English systems).

Data

The core part of PCEDT 1.0 is a Czech translation of 21,600 English sentences from the Wall Street Journal, which are part of Treebank-3 (LDC99T42). Sentences of the Czech translation were automatically morphologically annotated and parsed into two levels (analytical and tectogrammatical) of dependency structures introduced in the theory of Functional Generative Description and closely related to Prague Dependency Treebank 1.0 (LDC2001T10). The original English sentences were transformed from the Penn Treebank phrase-structure trees into dependency representations. A heldout (development and evaluation) set of 515 sentence pairs was selected and manually annotated on a tectogrammatical level in both Czech and English; for the purposes of quantitative evaluation, this set has been retranslated from Czech into English by four different translation companies.

PCEDT 1.0 also contains a parallel Czech-English corpus of plain text from Reader's Digest 1993-1996 consisting of 53,000 parallel sentences, and a large monolingual corpus of Czech (2.4 M sentences). The included Czech-English translation dictionary consists of 46,150 translation pairs in its lemmatized version and 496,673 pairs of word forms, where for each entry-translation pair all corresponding word form pairs have been generated. Also included is an English-Czech dictionary provided by Milan Svoboda under GNU/FDL license; this dictionary contains multi-word translations in 115,929 translation pairs.

Prague Czech-English Dependency Treebank 2.0 (LDC2012T08) translates the whole Wall Street Journal part of the Penn Treebank. Please consult the PCEDT 2.0 website for more information and documentation.

Samples

For an example of the data in this corpus, please view this sample (TXT).

Sponsorship

PCEDT 1.0 has been supported by the following grants and projects:

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee