Home › Language Resources › Data

Prague Czech-English Dependency Treebank 2.0

Item Name:	Prague Czech-English Dependency Treebank 2.0
Author(s):	Jan Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, Silvie Cinková, Eva Fučíková, Marie Mikulová, Petr Pajas, Jan Popelka, Jiří Semecký, Jana Šindlerová, Jan Štěpánek, Josef Toman, Zdeňka Urešová, Zdeněk Žabokrtský
LDC Catalog No.:	LDC2012T08
ISBN:	1-58563-616-9
ISLRN:	443-974-834-414-7
DOI:	https://doi.org/10.35111/mv82-j246
Release Date:	June 15, 2012
Member Year(s):	2012
DCMI Type(s):	Text
Data Source(s):	newswire
Application(s):	tagging, parsing, machine translation, language teaching, language modeling, information retrieval, information extraction
Language(s):	English, Czech
Language ID(s):	eng, ces
License(s):	LDC User Agreement for Non-Members
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Hajič, Jan , et al. Prague Czech-English Dependency Treebank 2.0 LDC2012T08. Web Download. Philadelphia: Linguistic Data Consortium, 2012.
Related Works: Hide	View isVersionOf LDC2004T25 Prague Czech-English Dependency Treebank 1.0 isAnnotationOf LDC99T42 Treebank-3 isSimilarWith LDC2004T23 Prague Arabic Dependency Treebank 1.0 LDC2006T01 Prague Dependency Treebank 2.0

Introduction

Prague Czech-English Dependency Treebank (PCEDT) 2.0 was developed by the Institute of Formal and Applied Linguistics at Charles University in Prague, Czech Republic. It is a corpus of Czech-English parallel resources translated, aligned and manually annotated for dependency structure, semantic labeling, argument structure, ellipsis and anaphora resolution. This release updates Prague Czech-English Dependency Treebank 1.0 (LDC2004T25) by adding English newswire texts so that it now contains over two million words in close to 100,000 sentences.

Data

The principal new material in PCEDT 2.0 is the inclusion of the entire Wall Street Journal data from Treebank-3 (LDC99T42). Not included from PCEDT 1.0 are the Readers Digest material, the Czech monolingual corpus, and the English-Czech dictionary.

Each section is enhanced with a comprehensive manual linguistic annotation in the Prague Dependency Treebank style (LDC2006T01, Prague Dependency Treebank 2.0). The main features of this annotation style are:

dependency structure of the content words and coordinating and similar structures (function words are attached as their attribute values)
semantic labeling of content words and types of coordinating structures
argument structure, including an argument structure (valency) lexicon for both languages
ellipsis and anaphora resolution

This annotation style is called tectogrammatical annotation, and it constitutes the tectogrammatical layer in the corpus.

Please consult the PCEDT website for more information and documentation.

Samples

Please follow this link for a sample of the data included.

Updates

None at this time.

Prague Czech-English Dependency Treebank 2.0

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees