Prague Czech-English Dependency Treebank

version 1.0

Overview

Prague Czech-English Dependency Treebank (PCEDT) is a corpus of Czech-English parallel resources suitable for experiments in structural machine translation. PCEDT was developed at the Center for Computational Linguistics and the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague, with the support of the project MSMT LN00A063 and NSF Grant no. IIS 0121285.

The core part of the PCEDT is a Czech translation of 21,600 English sentences from Wall Street Journal part of Penn Treebank 3 corpus (PTB, released by LDC in 1999). Sentences of the Czech translation were automatically morphologically annotated and parsed into two levels (analytical and tectogrammatical) of dependency structures introduced in the theory of Functional Generative Description and closely related to the project of Prague Dependency Treebank (PDT). The original English sentences were transformed from the Penn Treebank phrase-structure trees into dependency representations. A heldout (development and evaluation) set of 515 sentence pairs was selected and manually annotated on tectogrammatical level in both Czech and English; for the purposes of quantitative evaluation this set has been retranslated from Czech to English by 4 different translation companies.

PCEDT also comprises a parallel Czech-English corpus of plain text from Reader's Digest 1993-1996 consisting of 53,000 parallel sentences, and a large monolingual corpus of Czech (2.4 M sentences).

Also included is a probabilistic Czech-English translation dictionary, which consists of 46,150 word-translation pairs of base forms.

Motivation

The efforts of Czech computational linguists concentrated in the past on creating large-scale monolingual corpora, such as the Czech National Corpus (100 million words annotated on morphological level) and Prague Dependency Treebank (PDT). The PDT is annotated on three levels: morphological layer (lowest), analytical layer (middle) - surface syntactic annotation, and tectogrammatical layer (highest) - level of linguistic meaning. Dependency trees, representing the sentence structure as concentrated around the verb and its valency, are used for the analytical and tectogrammatical layers of PDT.

When starting the PCEDT project, we were deciding between two possible strategies: either the parallel annotation of already existing parallel texts, or the translation and annotation of an existing syntactically annotated corpus.

The up-to-now main parallel Czech-English resource, Reader's Digest corpus, contains extremely free translations, which has proved "difficult" in several machine-learning experiments (Al-Onaizan, et al., 1999). Therefore, we decided for the human translation of an existing monolingual syntactically annotated corpus and its subsequent syntactic annotation. This allows us to better control the translation quality and reliability, and also reduces the necessary annotation efforts.

The Wall Street Journal section of the Penn Treebank and the Prague Dependency Treebank are corpora comparable in size (about 1 million words), and they both contain syntactically annotated newspaper texts. The choice of the Penn Treebank as the source corpus was pragmatically motivated: firstly it is a widely recognized linguistic resource, and secondly the translators were native speakers of Czech, capable of high quality translation into their native language. The translators were asked to translate each English sentence as a single Czech sentence and to avoid unnecessary stylistic changes of translated sentences.

Since Czech is a language with relatively high degree of word-order freedom, its sentences contain certain syntactic phenomena, such as discontinuous constituents, which cannot be straightforwardly handled using the annotation scheme of Penn Treebank, based on phrase-structure trees. Therefore, we decided to adopt the dependency-based annotation scheme of PDT for the PCEDT.

While the morphological annotation of the English part is simply taken over from the Penn Treebank, the analytical and tectogrammatical markups of the English part of the corpus are obtained by two independent procedures transforming the phrase-structure trees into dependency ones.


Data

Czech-English Penn Treebank Corpus

Original Penn Trebank data

The CD contains a copy of the original (English) data from Penn Treebank that were selected for translation. A unique ID was assigned to each sentence.

Czech-English Raw Text Data

The original English data from Penn Treebank, transformed into raw text, each sentence starting with a unique id, were given as a source for human translators. The resulting Czech translations are in raw text format (character set ISO-8859-2), each sentence starts with a unique id that can be derived from the original id of the English sentence.

Development and Evaluation Test Set

We selected a test set of 515 sentences for development and evaluation. For the purpose of quantitative evaluation methods for machine translation we also had them retranslated from Czech into English by 4 different translator offices.

Automatic conversions of Penn Treebank annotation into dependency structure

Except for the test part of the data, which is tectogrammatically annotated by human annotators, the Czech part of the corpus is annotated by automatic means starting from the raw data, while the tools for automatic markup of the English part make use of the existing annotation of the Penn Treebank corpus.

Preprocessing of Penn Treebank

The general recursive transformation algorithm from phrase tree topology into dependency one works as follows: The concept of head of a phrase is important when transforming the phrase tree topology into the dependency one. We used Jason Eisner's scripts for marking head constituents in each phrase.

Lemmatization - assigning base forms - is necessary in almost all experiments with languages rich on morphology (such as Czech). For English, the task is not of such importance, more over, it is substantially simplified by the fact that Penn Treebank data contain manually assigned POS tags. The lemmatization procedure just searches the list of all triples of word form, POS tag and lemma. The list of 910,216 triples was obtained by running MXPOST tagger and morpha lemmatization tool on a large corpus of English (365M words, 13M sentences). The lemmatization procedure makes two attempts to find a lemma: first, it tries to find a triple with a matching word form and its (manually assigned) POS, and then, if it fails, it makes a second attempt with the word form converted to lowercase. If both attempts fail, then the given word form is chosen as the lemma.

For technical reasons, a unique identifier is assigned to each token.

Annotation of English - Analytical Representation

The structural transformation works as described above. Because the handling of coordination in PDT is different from the Penn Treebank annotation style and the output of Jason Eisner's head assigning scripts, in the case of a phrase containing a coordinating conjunction (CC), we consider the rightmost CC as the head. The treatment of apposition is a more difficult task, since there is no explicit annotation of this phenomenon in the Penn Treebank; constituents of a noun phrase separated by commas (and not containing CC) are considered to be in apposition and the rightmost comma is the head. The information from the phrase tree and the structure of the dependency tree are both used for analytical function assignment: Specifics of the PDT and Penn Treebank annotation schemes, mainly the markup of coordinations, appositions, and prepositional phrases, are handled by this step:

Annotation of English - Tectogrammatical Representation

The transformation of Penn Treebank phrase trees into tectogrammatical representation consists of a structural transformation and the assignment of a tectogrammatical functor and a set of grammatemes to each node of the resulting tree.

At the beginning of the structural transformation, the initial dependency tree is created by a general transformation procedure as described above. However, there are differences in the notion of head between phrasal grammar and the guidelines for tectogrammatical annotation; for example, the head of a prepositional phrase is not a preposition. In the next step, nodes corresponding to functional (synsemantic) words, such as prepositions, punctuation marks, determiners, subordinating conjunctions, certain particles, auxiliary and modal verbs are marked as "hidden" and information about them is stored in their governing nodes. The well-formedness of a tectogrammatical tree structure requires the valency frames to be complete: apart from nodes that are realized on surface, there are several types of "restored" nodes representing the non-realized members of valency frames (cf. pro-drop property of Czech and verbal condensations using gerunds and infinitives both in Czech and English). For a partial reconstruction of such nodes, we can use traces, which allow us to establish coreferential links or general participants of the valency frames.

For the assignment of tectogrammatical functors, we can use rules taking into consideration POS tags (e.g. PRP->APP), function tags (JJ->RSTR, JJR->CPR, etc.) and lemma ("not"->RHEM, "both"->RSTR).

Grammateme Assignment - morphological (e.g. Tense, Degree of Comparison) and syntactic grammatemes (e.g. TWHEN_AFT(er)) are assigned to each node of the tectogrammatical tree. The assignment of the morphological attributes is based on Penn Treebank tags and reflects basic morphological properties of the language. The syntactic grammatemes capture more specific information about deep syntactic structure. At the moment, there are no automatic tools for the assignment of the latter ones.

The whole procedure is described in detail in Kučerová and Žabokrtský (2002).

The quality of such a transformation, based on comparison with manually annotated trees, is about 6% of wrongly aimed dependencies and 18% of wrongly assigned functors.

Automatic annotation of Czech

Morphological tagging of Czech

The Czech translations of Penn Treebank were automatically tokenized and morphologically tagged, each word form was assigned a basic form - lemma by Hajič and Hladká (1998) tagging tools.

Analytical parsing of Czech

Czech analytical parsing consists of a statistical dependency parser for Czech - either Collins parser (Hajič et al., 1998) or Charniak parser (Charniak, 1999), both adapted to dependency grammar - and a module for automatic analytical functor assignment (Žabokrtský et al., 2002). Both versions of output from Collins and Charniak parsers are present. For efficiency reasons, sentences longer than 60 words were excluded from the corpus parsed by Collins parser.

Transition to tectogrammatical representation of Czech

When building the tectogrammatical structure, the analytical tree structure is converted into the tectogrammatical one. These transformations are described by linguistic rules (Böhmová, 2001). Then, tectogrammatical functors are assigned by a C4.5 classifier (Žabokrtský et al., 2002).


Manual Tectogrammatical Annotation of Czech and English

Since there are no guidelines for tectogrammatical annotation of English yet, and in order to acquire some initial experience before the work on the guidelines begins, a "gold standard" tectogrammatical annotation of more than 1,000 sentences has been done. These data are assigned morphological grammatemes (the full set of values) and syntactic grammatemes, and the nodes are reordered according to topic-focus-articulation (information structure). The manually annotated sentences comprise the whole development and evaluation test set. Also the Czech counterpart of the test set has been manually annotated according to the guidelines for tectogrammatical annotation of Czech.


Reader's Digest Corpus

This corpus contains parallel raw text of 450 articles from the Reader's Digest, years 1993-1996. The Czech part is a translation of the English one. Sentence pairs were aligned automatically by Dan Melamed's SIMR/GMA tool. Since the translations in this corpus are relatively free, only 43969 of 54091 aligned segments contain 1-to-1 sentence alignments.

Czech Monolingual Corpus

The electronic text sources have been provided by the Institute of Czech National Corpus. Originally, all data come from news articles which were published in the daily newspaper Lidove Noviny, 1994-1995. The inner format of the data corresponds to the csts format. The total data amount is more than 39M tokens (words proper + punctuation) in about 2385K sentences.

Dictionaries

Czech-English Probabilistic Dictionary

This dictionary was compiled from translations of lists of words extracted from Czech and English monolingual frequency dictionaries of base forms. For the translation of word lists we used three different Czech-English manual dictionaries: two of them were available on the Web (WinGED and GNU/FDL) and one was extracted from Czech and English EuroWordNets. Word-translation pairs were filtered and weighed taking into account the reliability of the source dictionary, the frequencies of the translations in the English monolingual corpus, and the correspondence of the Czech and English POS tags. Furthermore, by training GIZA++ translation model on the training part of the PCEDT extended by the manual dictionaries, we obtained a probabilistic Czech-English dictionary, more sensitive to the specific domain of financial news typical for the Wall Street Journal part of PTB.

The resulting Czech-English probabilistic dictionary contains 46,150 word-translation pairs.

The Czech monolingual frequency dictionary of base forms was compiled from 455,689,875 running words from the Czech National Corpus.
The English monolingual frequency dictionary of base forms was compiled from 310,308,540 running words from the North American News Text Collection.

The following sources were used for the translation of lists of words:

The dictionary contains one pair consisting of Czech word and its English translation per line.

Each line consists of six strings separated by space characters:

Here is a sample of the dictionary:

Cz En P(En|Cz) P(Cz|En) cnt(Cz) cnt(En)

obvykle#R typically#R 0.16 0.4 40074 6428
typicky#R typically#R 0.8 0.6 3445 6428
typicky#R characteristically#R 0.2 0.24 3445 174


NOTES:


Czech-English Dictionary of Word Forms

The PCEDT also comprises a Czech-English translation dictionary of word forms. This dictionary was generated from the Czech-English Probabilistic Dictionary and from lists of word forms which occur more than 100 times in the Czech and English monolingual corpora mentioned in the previous section. This dictionary contains 496,673 word-translation pairs. For example, for the word-translation pair "bankéř" - "banker" in the Czech-English Probabilistic Dictionary the Czech-English Dictionary of Word Forms contains the following lines (comments in green are not a part of the data):

                                 
     bankéř     banker           nominative sg.
     bankéře    banker           genitive + accusative sg.
     bankéře    bankers          accusative pl.
     bankéřem   banker           instrumental sg.
     bankéři    banker           dative + vocative + locative sg.
     bankéři    bankers          nominative + vocative pl.
     bankéřů    bankers          genitive pl.
     bankéřům   bankers          dative pl.

Note that since Czech "bankéři" is ambiguous for singular dative, vocative, or locative, and for plural nominative or vocative it can be translated into English in both singular and plural.


GNU/FDL English-Czech Dictionary

This is an English-Czech Dictionary provided by Milan Svoboda at http://slovnik.zcu.cz under GNU FDL (Free documentation license). The version included on PCEDT was downloaded on 12th February 2004 and contains 115,929 word-translation pairs (and about 81,500 not translated English entries). Daily updates (of the file slovnik_data.txt) are available at http://slovnik.zcu.cz/download.php.

The format of this dictionary is plain text. Except for comments marked by '#', each line begins with an English word, its Czech translation is in the second field (fields are separated by tabs). The line may continue with additional information, such as POS tag, domain of use, or author of the translation.

Data Sizes

Description of Data #sentences#words
PTB Corpus: English part    
 - manually annotated on tectogrammatical level 1,257 33,980
 - automatically transformed into analytical & tectogrammatical levels 49,208 1,173,766
 - retranslated by 4 different human translators 515 13,143
PTB Corpus: Czech part    
 - manually annotated on tectogrammatical level 472 11,077
 - automatically parsed into analytical & tectogrammatical levels 21,656 487,920
Reader's Digest Corpus 43,969 659,059
Czech Monolingual Corpus - Lidové Noviny 2,385,000 39,000,000
Translation Dictionaries #entry-translation pairs
 - Czech-English probabilistic dictionary 46,100
 - Czech-English dictionary of word forms 496,673
 - English-Czech dictionary under GNU/FDL 115,929


Tools

SMT Quick Run

SMT Quick Run is a package of scripts and instructions for building statistical machine translation system from the PCEDT or any other parallel corpus. Follow instructions at SMT Quick Run Package page.

Tree Editor TrEd

Tree Editor (TrEd) is a graphical editor and viewer of tree structures. Internally TrEd works with files in the so-called FS-format and used for analytical and tectogrammatical dependency trees. TrEd has a modular architecture allowing custom input/output modules to be created in order to support other data formats.

TrEd supports the following platforms:

See installation instructions and documentation at TrEd Package page or at TrEd Homepage.

TrEd handles files in both FS (*.fs) and CSTS-SGML (*.csts) formats.

NetGraph

Netgraph is a multi-platform client-server application allowing you to browse, select and view analytical and tectogrammatical dependency trees. It can either view Czech trees from Prague Dependency Treebank (PDT) on the remote server located at the Institute of Formal and Applied Linguistics in Prague, or you can install your own server for viewing trees from PCEDT.

See NetGraph Client Manual and NetGraph Homepage for instructions, how to install and set up the NetGraph client, and NetGraph Server Manual for installing the server.

NetGraph reads files in FS-format (*.fs).

References


About Prague Czech-English Dependency Treebank

Prague Dependency Treebank

Statistical Machine Translation

Structural Machine Translation

Other Related References