Home › Language Resources › Data

Prague Dependency Treebank 2.0

Item Name:	Prague Dependency Treebank 2.0
Author(s):	Jan Hajič, Jarmila Panevová, Eva Hajičová, Petr Sgall, Petr Pajas, Jan Štěpánek, Jiří Havelka, Marie Mikulová, Zdeněk Žabokrtský, Magda Ševčíková-Razímová, Zdeňka Urešová
LDC Catalog No.:	LDC2006T01
ISBN:	1-58563-370-4
ISLRN:	942-053-729-014-3
DOI:	https://doi.org/10.35111/e6p0-9s32
Release Date:	July 21, 2006
Member Year(s):	2006
DCMI Type(s):	Text
Data Source(s):	journal articles, news magazine, newswire
Application(s):	information extraction, information retrieval, language modeling, language teaching, parsing, tagging
Language(s):	Czech
Language ID(s):	ces
License(s):	Prague Dependency Treebank 2.0
Online Documentation:	LDC2006T01 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Hajič, Jan , et al. Prague Dependency Treebank 2.0 LDC2006T01. Web Download. Philadelphia: Linguistic Data Consortium, 2006.
Related Works: Hide	View isVersionOf LDC2001T10 Prague Dependency Treebank 1.0 hasAnnotation LDC2012T03 2009 CoNLL Shared Task Part 1 LDC2016T10 SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing LDC2018T06 2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish isSimilarWith LDC2004T23 Prague Arabic Dependency Treebank 1.0 LDC2004T25 Prague Czech-English Dependency Treebank 1.0 LDC2008T22 Czech Academic Corpus 2.0 LDC2012T08 Prague Czech-English Dependency Treebank 2.0

Introduction

The Prague Dependency Treebank 2.0 (PDT 2.0) was developed by Charles University and contains approximately 2 million words of Czech text with complex and interlinked morphological, syntactic, and complex semantic annotation. In addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level.

PDT 2.0 follows Prague Dependency Treebank 1.0 (LDC2001T10) and is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Software tools for corpus search, annotation, and language analysis are included. Extensive documentation (in English) is provided as well.

Data

The data in this corpus comes from four sources:

Lidové Noviny (daily newspapers), 1991, 1994, 1995
Mladá Fronta Dnes (daily newspapers), 1992
Českomoravský Profit (business weekly), 1994
Vesmír (scientific journal), 1992, 1993

The texts in electronic form have been provided by the Institute of the Czech National Corpus.

The data in PDT 2.0 are annotated on three layers—the morphological layer, analytical layer, and tectogrammatical layer. The following table shows the breakdown by annotation layer and source of data amounts in K-words (thousands of words). Each subsequent layer is additive, so everything that was annotated at the a-layer was also annotated at the m-layer, and everything annotated at the t-layer was also annotated at the other two layers.

Layer	Lidové Noviny	Mladá Fronta Dnes	Českomoravský Profit	Vesmír	Total
m-layer	1,235	373	171	178	1,957
a-layer	920	234	171	178	1,504
t-layer	640	119	74	0	833

The primary data format for PDT 2.0 is an XML6-based format called PML. A SGML-based format, called CSTS, has been the primary format of PDT 1.0. It is now used only as an intermediate format in older NLP tools (such as taggers and parsers).

As usual, the data are divided into three groups: the training data, the development test data and the evaluation test data. The training data cover approximately 80%, development 10% and evaluation 10% of the whole set of data (these proportions hold for all the three layers of annotation).

Samples

For an example of the data in this corpus, please view these samples.

Updates

None at this time.

Copyright

Portions © 1991, 1994,1995 Lidové noviny daily newspapers, © 1992 Mladá fronta Dnes daily newspapers, © 1994 Ceskomoravský Profit business weekly, © 1992-1993 Vesmír scientific magazine, Academia Publishers, © 1996-2005 Institute of Formal and Applied Linguistics and Center for Computational Linguistics, Faculty of Mathematics and Physics, Charles University, © 2006 Trustees of the University of Pennsylvania