Prague Czech-English Dependency Treebank

version 1.0

Overview

Prague Czech-English Dependency Treebank (PCEDT) is a corpus of Czech-English parallel resources suitable for experiments in structural machine translation. PCEDT was developed at the Center for Computational Linguistics and the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague, with the support of the project MSMT LN00A063 and NSF Grant no. IIS 0121285.

The core part of the PCEDT is a Czech translation of 21,600 English sentences from Wall Street Journal part of Penn Treebank 3 corpus (PTB, released by LDC in 1999). Sentences of the Czech translation were automatically morphologically annotated and parsed into two levels (analytical and tectogrammatical) of dependency structures introduced in the theory of Functional Generative Description and closely related to the project of Prague Dependency Treebank (PDT). The original English sentences were transformed from the Penn Treebank phrase-structure trees into dependency representations. A heldout (development and evaluation) set of 515 sentence pairs was selected and manually annotated on tectogrammatical level in both Czech and English; for the purposes of quantitative evaluation this set has been retranslated from Czech to English by 4 different translation companies.

PCEDT also comprises a parallel Czech-English corpus of plain text from Reader's Digest 1993-1996 consisting of 53,000 parallel sentences, and a large monolingual corpus of Czech (2.4 M sentences).

Also included is a probabilistic Czech-English translation dictionary, which consists of 46,150 word-translation pairs of base forms.

Motivation

The efforts of Czech computational linguists concentrated in the past on creating large-scale monolingual corpora, such as the Czech National Corpus (100 million words annotated on morphological level) and Prague Dependency Treebank (PDT). The PDT is annotated on three levels: morphological layer (lowest), analytical layer (middle) - surface syntactic annotation, and tectogrammatical layer (highest) - level of linguistic meaning. Dependency trees, representing the sentence structure as concentrated around the verb and its valency, are used for the analytical and tectogrammatical layers of PDT.

When starting the PCEDT project, we were deciding between two possible strategies: either the parallel annotation of already existing parallel texts, or the translation and annotation of an existing syntactically annotated corpus.

The up-to-now main parallel Czech-English resource, Reader's Digest corpus, contains extremely free translations, which has proved "difficult" in several machine-learning experiments (Al-Onaizan, et al., 1999). Therefore, we decided for the human translation of an existing monolingual syntactically annotated corpus and its subsequent syntactic annotation. This allows us to better control the translation quality and reliability, and also reduces the necessary annotation efforts.

The Wall Street Journal section of the Penn Treebank and the Prague Dependency Treebank are corpora comparable in size (about 1 million words), and they both contain syntactically annotated newspaper texts. The choice of the Penn Treebank as the source corpus was pragmatically motivated: firstly it is a widely recognized linguistic resource, and secondly the translators were native speakers of Czech, capable of high quality translation into their native language. The translators were asked to translate each English sentence as a single Czech sentence and to avoid unnecessary stylistic changes of translated sentences.

Since Czech is a language with relatively high degree of word-order freedom, its sentences contain certain syntactic phenomena, such as discontinuous constituents, which cannot be straightforwardly handled using the annotation scheme of Penn Treebank, based on phrase-structure trees. Therefore, we decided to adopt the dependency-based annotation scheme of PDT for the PCEDT.

While the morphological annotation of the English part is simply taken over from the Penn Treebank, the analytical and tectogrammatical markups of the English part of the corpus are obtained by two independent procedures transforming the phrase-structure trees into dependency ones.

Data

Czech-English Penn Treebank Corpus

Original Penn Trebank data

The CD contains a copy of the original (English) data from Penn Treebank that were selected for translation. A unique ID was assigned to each sentence.

Czech-English Raw Text Data

The original English data from Penn Treebank, transformed into raw text, each sentence starting with a unique id, were given as a source for human translators. The resulting Czech translations are in raw text format (character set ISO-8859-2), each sentence starts with a unique id that can be derived from the original id of the English sentence.

Development and Evaluation Test Set

We selected a test set of 515 sentences for development and evaluation. For the purpose of quantitative evaluation methods for machine translation we also had them retranslated from Czech into English by 4 different translator offices.

Automatic conversions of Penn Treebank annotation into dependency structure

Except for the test part of the data, which is tectogrammatically annotated by human annotators, the Czech part of the corpus is annotated by automatic means starting from the raw data, while the tools for automatic markup of the English part make use of the existing annotation of the Penn Treebank corpus.

Preprocessing of Penn Treebank

The general recursive transformation algorithm from phrase tree topology into dependency one works as follows:

Terminal nodes of the phrase are converted to nodes of the dependency tree.
Constituents of a non-terminal node are converted into separate dependency trees. The root node of the dependency tree transformed from the head constituent becomes the main root. Dependency trees transformed from the left and right siblings of the head constituent are attached to the main root as the left and right children, respectively.
Nodes representing traces are removed and their children are reattached to the parent of the trace.

The concept of head of a phrase is important when transforming the phrase tree topology into the dependency one. We used Jason Eisner's scripts for marking head constituents in each phrase.

Lemmatization - assigning base forms - is necessary in almost all experiments with languages rich on morphology (such as Czech). For English, the task is not of such importance, more over, it is substantially simplified by the fact that Penn Treebank data contain manually assigned POS tags. The lemmatization procedure just searches the list of all triples of word form, POS tag and lemma. The list of 910,216 triples was obtained by running MXPOST tagger and morpha lemmatization tool on a large corpus of English (365M words, 13M sentences). The lemmatization procedure makes two attempts to find a lemma: first, it tries to find a triple with a matching word form and its (manually assigned) POS, and then, if it fails, it makes a second attempt with the word form converted to lowercase. If both attempts fail, then the given word form is chosen as the lemma.

For technical reasons, a unique identifier is assigned to each token.

Annotation of English - Analytical Representation

The structural transformation works as described above. Because the handling of coordination in PDT is different from the Penn Treebank annotation style and the output of Jason Eisner's head assigning scripts, in the case of a phrase containing a coordinating conjunction (CC), we consider the rightmost CC as the head. The treatment of apposition is a more difficult task, since there is no explicit annotation of this phenomenon in the Penn Treebank; constituents of a noun phrase separated by commas (and not containing CC) are considered to be in apposition and the rightmost comma is the head. The information from the phrase tree and the structure of the dependency tree are both used for analytical function assignment:

WSJ function tag to analytical function mapping: some function tags of a phrase tree correspond to analytical functions in an analytical tree and can be mapped to them: SBJ->Sb, DTV->Obj, LGS->Obj, BNF->Obj, TPC->Obj, CLR->Obj, ADV->Adv, DIR->Adv, EXT->Adv, LOC->Adv, MNR->Adv, PRP->Adv, TMP->Adv, PUT->Adv.
Assignment of analytical functions using local context: for assigning analytical functions to the remaining nodes, we use simple rules taking into account POS and the name of the constituent headed by a node in the original phrase tree. In the rules this information for the current node, its parent and grandparent can be used. For example, the rule mPOS=DT|mAF=Atr assigns the analytical function Atr to every determiner, the rule mPOS=MD|pPOS=VB|mAF=AuxV assigns the function tag AuxV to a modal verb headed by a verb, etc. The attribute mPOS representing the POS of the node is obligatory for every rule. The rules are first examined in the order of the longest prefix of the POS of the given node and secondly in the order as they are listed in the rule file. The ordering of rules is important since the first matching rule found assigns the analytical function and the search is finished.

Specifics of the PDT and Penn Treebank annotation schemes, mainly the markup of coordinations, appositions, and prepositional phrases, are handled by this step:

Coordinations and appositions: the analytical function, which was originally assigned to the head of a coordination or apposition is propagated to its children nodes by attaching the suffix _Co or _Ap to them and the head node gets the analytical function Coord or Apos, respectively.
Prepositional phrases: the analytical function originally assigned to a preposition node is propagated to its child and the preposition node is labeled AuxP.
Sentences in the PDT annotation style always contain a root node labeled AuxS, which, as the only one in the dependency tree, does not correspond to any terminal of the phrase tree; the root node is inserted above the original root. While in the Penn Treebank the final punctuation is a constituent of the sentence phrase, in the analytical tree it is moved under the technical sentence root node.

Annotation of English - Tectogrammatical Representation

The transformation of Penn Treebank phrase trees into tectogrammatical representation consists of a structural transformation and the assignment of a tectogrammatical functor and a set of grammatemes to each node of the resulting tree.

At the beginning of the structural transformation, the initial dependency tree is created by a general transformation procedure as described above. However, there are differences in the notion of head between phrasal grammar and the guidelines for tectogrammatical annotation; for example, the head of a prepositional phrase is not a preposition. In the next step, nodes corresponding to functional (synsemantic) words, such as prepositions, punctuation marks, determiners, subordinating conjunctions, certain particles, auxiliary and modal verbs are marked as "hidden" and information about them is stored in their governing nodes. The well-formedness of a tectogrammatical tree structure requires the valency frames to be complete: apart from nodes that are realized on surface, there are several types of "restored" nodes representing the non-realized members of valency frames (cf. pro-drop property of Czech and verbal condensations using gerunds and infinitives both in Czech and English). For a partial reconstruction of such nodes, we can use traces, which allow us to establish coreferential links or general participants of the valency frames.

For the assignment of tectogrammatical functors, we can use rules taking into consideration POS tags (e.g. PRP->APP), function tags (JJ->RSTR, JJR->CPR, etc.) and lemma ("not"->RHEM, "both"->RSTR).

Grammateme Assignment - morphological (e.g. Tense, Degree of Comparison) and syntactic grammatemes (e.g. TWHEN_AFT(er)) are assigned to each node of the tectogrammatical tree. The assignment of the morphological attributes is based on Penn Treebank tags and reflects basic morphological properties of the language. The syntactic grammatemes capture more specific information about deep syntactic structure. At the moment, there are no automatic tools for the assignment of the latter ones.

The whole procedure is described in detail in Kučerová and Žabokrtský (2002).

The quality of such a transformation, based on comparison with manually annotated trees, is about 6% of wrongly aimed dependencies and 18% of wrongly assigned functors.

Automatic annotation of Czech

Morphological tagging of Czech

The Czech translations of Penn Treebank were automatically tokenized and morphologically tagged, each word form was assigned a basic form - lemma by Hajič and Hladká (1998) tagging tools.

Analytical parsing of Czech

Czech analytical parsing consists of a statistical dependency parser for Czech - either Collins parser (Hajič et al., 1998) or Charniak parser (Charniak, 1999), both adapted to dependency grammar - and a module for automatic analytical functor assignment (Žabokrtský et al., 2002). Both versions of output from Collins and Charniak parsers are present. For efficiency reasons, sentences longer than 60 words were excluded from the corpus parsed by Collins parser.

Transition to tectogrammatical representation of Czech

When building the tectogrammatical structure, the analytical tree structure is converted into the tectogrammatical one. These transformations are described by linguistic rules (Böhmová, 2001). Then, tectogrammatical functors are assigned by a C4.5 classifier (Žabokrtský et al., 2002).

Manual Tectogrammatical Annotation of Czech and English

Since there are no guidelines for tectogrammatical annotation of English yet, and in order to acquire some initial experience before the work on the guidelines begins, a "gold standard" tectogrammatical annotation of more than 1,000 sentences has been done. These data are assigned morphological grammatemes (the full set of values) and syntactic grammatemes, and the nodes are reordered according to topic-focus-articulation (information structure). The manually annotated sentences comprise the whole development and evaluation test set. Also the Czech counterpart of the test set has been manually annotated according to the guidelines for tectogrammatical annotation of Czech.

Reader's Digest Corpus

This corpus contains parallel raw text of 450 articles from the Reader's Digest, years 1993-1996. The Czech part is a translation of the English one. Sentence pairs were aligned automatically by Dan Melamed's SIMR/GMA tool. Since the translations in this corpus are relatively free, only 43969 of 54091 aligned segments contain 1-to-1 sentence alignments.

Czech Monolingual Corpus

The electronic text sources have been provided by the Institute of Czech National Corpus. Originally, all data come from news articles which were published in the daily newspaper Lidove Noviny, 1994-1995. The inner format of the data corresponds to the csts format. The total data amount is more than 39M tokens (words proper + punctuation) in about 2385K sentences.

Dictionaries

Czech-English Probabilistic Dictionary

This dictionary was compiled from translations of lists of words extracted from Czech and English monolingual frequency dictionaries of base forms. For the translation of word lists we used three different Czech-English manual dictionaries: two of them were available on the Web (WinGED and GNU/FDL) and one was extracted from Czech and English EuroWordNets. Word-translation pairs were filtered and weighed taking into account the reliability of the source dictionary, the frequencies of the translations in the English monolingual corpus, and the correspondence of the Czech and English POS tags. Furthermore, by training GIZA++ translation model on the training part of the PCEDT extended by the manual dictionaries, we obtained a probabilistic Czech-English dictionary, more sensitive to the specific domain of financial news typical for the Wall Street Journal part of PTB.

The resulting Czech-English probabilistic dictionary contains 46,150 word-translation pairs.

The Czech monolingual frequency dictionary of base forms was compiled from 455,689,875 running words from the Czech National Corpus.
The English monolingual frequency dictionary of base forms was compiled from 310,308,540 running words from the North American News Text Collection.

The following sources were used for the translation of lists of words:

Czech-English and English-Czech WinGED dictionary (http://www.rewin.cz/, http://slovnik.atlas.cz/)
GNU/FDL Czech-English Dictionary (http://slovnik.zcu.cz/)
Czech and English EuroWordNets (http://www.illc.uva.nl/EuroWordNet/)

The dictionary contains one pair consisting of Czech word and its English translation per line.

Each line consists of six strings separated by space characters:

Czech base form and the first letter of POS tag (joined by a hash (#) character)
English base form and the first letter of POS tag (joined by hash (#) character)
Conditional probability P(En|Cz) from GIZA++ training
Conditional probability P(Cz|En) from GIZA++ training
Count of the Czech base form from the Czech monolingual corpus
Count of the English base form from the English monolingual corpus

Here is a sample of the dictionary:

Cz En P(En|Cz) P(Cz|En) cnt(Cz) cnt(En)

obvykle#R typically#R 0.16 0.4 40074 6428
typicky#R typically#R 0.8 0.6 3445 6428
typicky#R characteristically#R 0.2 0.24 3445 174

NOTES:

Multi-word entries or translations such as "take_off" are joined by the underscore character, for example: "take_off#V"
Sum of P(En|Cz) probabilities for a particular Czech word (Cz) is less or equal 1. It doesn't necessarily sum to 1 because some of the pairs from GIZA++ table might have been omitted from the dictionary.
One can obtain the probability of a Czech word by dividing cnt(Cz) by the number of running words in the Czech monolingual corpus (i.e. 455,689,875) and the probability of an English word by dividing cnt(En) by the number of running words in the English monolingual corpus (i.e. 310,308,540).

Czech-English Dictionary of Word Forms

The PCEDT also comprises a Czech-English translation dictionary of word forms. This dictionary was generated from the Czech-English Probabilistic Dictionary and from lists of word forms which occur more than 100 times in the Czech and English monolingual corpora mentioned in the previous section. This dictionary contains 496,673 word-translation pairs. For example, for the word-translation pair "bankéř" - "banker" in the Czech-English Probabilistic Dictionary the Czech-English Dictionary of Word Forms contains the following lines (comments in green are not a part of the data):

                                 
     bankéř     banker           nominative sg.
     bankéře    banker           genitive + accusative sg.
     bankéře    bankers          accusative pl.
     bankéřem   banker           instrumental sg.
     bankéři    banker           dative + vocative + locative sg.
     bankéři    bankers          nominative + vocative pl.
     bankéřů    bankers          genitive pl.
     bankéřům   bankers          dative pl.

Note that since Czech "bankéři" is ambiguous for singular dative, vocative, or locative, and for plural nominative or vocative it can be translated into English in both singular and plural.

GNU/FDL English-Czech Dictionary

This is an English-Czech Dictionary provided by Milan Svoboda at http://slovnik.zcu.cz under GNU FDL (Free documentation license). The version included on PCEDT was downloaded on 12th February 2004 and contains 115,929 word-translation pairs (and about 81,500 not translated English entries). Daily updates (of the file slovnik_data.txt) are available at http://slovnik.zcu.cz/download.php.

The format of this dictionary is plain text. Except for comments marked by '#', each line begins with an English word, its Czech translation is in the second field (fields are separated by tabs). The line may continue with additional information, such as POS tag, domain of use, or author of the translation.

Data Sizes

Description of Data	#sentences	#words
PTB Corpus: English part
- manually annotated on tectogrammatical level	1,257	33,980
- automatically transformed into analytical & tectogrammatical levels	49,208	1,173,766
- retranslated by 4 different human translators	515	13,143
PTB Corpus: Czech part
- manually annotated on tectogrammatical level	472	11,077
- automatically parsed into analytical & tectogrammatical levels	21,656	487,920
Reader's Digest Corpus	43,969	659,059
Czech Monolingual Corpus - Lidové Noviny	2,385,000	39,000,000
Translation Dictionaries	#entry-translation pairs
- Czech-English probabilistic dictionary	46,100
- Czech-English dictionary of word forms	496,673
- English-Czech dictionary under GNU/FDL	115,929

Tools

SMT Quick Run

SMT Quick Run is a package of scripts and instructions for building statistical machine translation system from the PCEDT or any other parallel corpus. Follow instructions at SMT Quick Run Package page.

Tree Editor TrEd

Tree Editor (TrEd) is a graphical editor and viewer of tree structures. Internally TrEd works with files in the so-called FS-format and used for analytical and tectogrammatical dependency trees. TrEd has a modular architecture allowing custom input/output modules to be created in order to support other data formats.

TrEd supports the following platforms:

Windows 95/98/ME or Windows NT/2000/XP (TM)
Linux
BSD, UNIX, Solaris (TM) and other UNIX-based systems

See installation instructions and documentation at TrEd Package page or at TrEd Homepage.

TrEd handles files in both FS (*.fs) and CSTS-SGML (*.csts) formats.

NetGraph

Netgraph is a multi-platform client-server application allowing you to browse, select and view analytical and tectogrammatical dependency trees. It can either view Czech trees from Prague Dependency Treebank (PDT) on the remote server located at the Institute of Formal and Applied Linguistics in Prague, or you can install your own server for viewing trees from PCEDT.

See NetGraph Client Manual and NetGraph Homepage for instructions, how to install and set up the NetGraph client, and NetGraph Server Manual for installing the server.

NetGraph reads files in FS-format (*.fs).

References

About Prague Czech-English Dependency Treebank

Martin Čmejrek, Jan Cuřín, Jiří Havelka. 2003. Treebanks in Machine Translation, In Proceedings of The Second Workshop on Treebanks and Linguistic Theories, Vaxjo, Sweden, pp. 209-212. Available in PostScript or PDF.
Jan Cuřín, Martin Čmejrek, Jiří Havelka, Vladislav Kuboň. 2004. Building Parallel Bilingual Syntactically Annontated Corpus, In Proceedings of The First International Joint Conference on Natural Language Processing, Hainan Island, China, pp. 141-146. Available in PDF.
Martin Čmejrek, Jan Cuřín, Jiří Havelka. 2004. Prague Czech-English Dependecy Treebank: Any Hopes for a Common Annotation Scheme?, In HLT/NAACL 2004 Workshop: Frontiers in Corpus Annotation, Boston, Massachusetts, pp. 47-54. Available in PostScript or PDF.
Martin Čmejrek, Jan Cuřín, Jiří Havelka, Jan Hajič, Vladislav Kuboň. 2004. Prague Czech-English Dependecy Treebank: Syntactically Annotated Resources for Machine Translation, In 4th International Conference on Language Resources and Evaluation, Lisbon, Portugal. Available in PostScript or PDF.

Prague Dependency Treebank

Eva Hajičová, Jarmila Panevová, and Petr Sgall. 2002. A Manual for Tectogrammatic Tagging of the Prague Dependency Treebank. Technical Report TR-2000-09, ÚFAL MFF UK, Prague, Czech Republic,
Jan Hajič et al., 2001. A Manual for Analytic Layer Tagging of the Prague Dependency Treebank, Prague, Czech Republic, English translation of the original Czech version. Available in PDF at http://quest.ms.mff.cuni.cz/pdt/Corpora/PDT_1.0/Doc/aman-en/index.html

Statistical Machine Translation

Al-Onaizan Yaser, Jan Cuřín, Michael Jahr, Kevin Knight, John Lafferty, Dan Melamed, Franz-Josef Och, David Purdy, Noah A. Smith, David Yarowsky. 1999. Statistical Machine Translation. Final Report, JHU Summer Workshop'99. Available at http://www.clsp.jhu.edu/ws99/projects/mt/final_report/mt-final-report.ps
Ulrich Germann. 2003. Greedy Decoding for Statistical Machine Translation in Almost Linear Time. In Proceedings of HLT-NAACL-2003, Edmonton, Canada.
Franz Josef Och and Hermann Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models, Computational Linguistics, volume 29, number 1, pp. 19-51.

Structural Machine Translation

Martin Čmejrek, Jan Cuřín, Jiří Havelka. 2002. Czech-English Dependency-based Machine Translation: Data Preparation for the Starting up Experiments, The Prague Bulletin of Mathematical Linguistics, Volume 78, pp. 103-116. Available in PostScript or PDF.
Martin Čmejrek, Jan Cuřín, Jiří Havelka. 2003. Czech-English Dependency-based Machine Translation, In Proceedings of the 10th Conference of The European Chapter of the Association for Computational Linguistics, Budapest, Hungary, pp. 83-90. Available in PostScript or PDF.

Other Related References

Alena Böhmová. 2001. Automatic Procedures in Tectogrammatical Tagging. The Prague Bulletin of Mathematical Linguistics, 76.
Eugene Charniak. 1999. A maximum-entropy-inspired parser. Technical Report CS-99-12.
Jan Hajič, Barbora Vidová-Hladká. 1998. Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. In Proceedings of the Conference COLING - ACL `98 Montreal, Canada.
Jan Hajič, Barbora Vidová-Hladká, Dan Zeman, Michael Collins, Lance Ramshaw, Christoph Tillmann, Eric Brill, D. Jones, C. Kuo, O. Schwartz. 1998. Core Natural Language Processing Technology Applicable to Multiple Languages. The Prague Bulletin of Mathematical Linguistics, 70.
Petr Sgall, Zdeněk Žabokrtský, Sašo Džeroski. 2002. A Machine Learning Approach to Automatic Functor Assignment in the Prague Dependency Treebank. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), pp. 1513--1520. ELRA.
Zdeněk Žabokrtský, Ivona Kučerová. 2002. Transforming Penn Treebank phrase trees into (Praguian) tectogrammatical dependency trees, The Prague Bulletin of Mathematical Linguistics, 78. On-line version at http://ckl.mff.cuni.cz/~zabokrtsky/wsj2tgts