Prague Czech-English Dependency
Treebank
version 1.0
Overview
Prague Czech-English Dependency Treebank (PCEDT) is a
corpus of Czech-English parallel resources suitable for
experiments in structural machine translation. PCEDT was
developed at the Center for Computational Linguistics and the Institute of
Formal and Applied Linguistics, Faculty of Mathematics
and Physics, Charles University in Prague, with the support
of the project MSMT LN00A063 and NSF Grant no. IIS 0121285.
The core part of the PCEDT is a Czech translation of 21,600
English sentences from Wall Street Journal part of
Penn Treebank 3 corpus (PTB, released by LDC in 1999).
Sentences of the Czech translation were automatically
morphologically annotated and parsed into two levels
(analytical and tectogrammatical) of dependency structures
introduced in the theory of Functional Generative
Description and closely related to the project of Prague Dependency Treebank (PDT). The
original English sentences were transformed from the Penn
Treebank phrase-structure trees into dependency
representations. A heldout (development and evaluation) set
of 515 sentence pairs was selected and manually annotated
on tectogrammatical level in both Czech and English; for
the purposes of quantitative evaluation this set has been
retranslated from Czech to English by 4 different
translation companies.
PCEDT also comprises a parallel Czech-English corpus of
plain text from Reader's Digest 1993-1996 consisting of
53,000 parallel sentences, and a large monolingual corpus
of Czech (2.4 M sentences).
Also included is a probabilistic Czech-English translation
dictionary, which consists of 46,150 word-translation pairs of
base forms.
Motivation
The efforts of Czech computational linguists concentrated in
the past on creating large-scale monolingual corpora, such as
the Czech National Corpus (100 million words annotated on
morphological level) and Prague Dependency Treebank (PDT).
The PDT is annotated on three levels: morphological layer
(lowest), analytical layer (middle) - surface syntactic
annotation, and tectogrammatical layer (highest) - level of
linguistic meaning. Dependency trees, representing the
sentence structure as concentrated around the verb and its
valency, are used for the analytical and tectogrammatical
layers of PDT.
When starting the PCEDT project, we were deciding between two
possible strategies: either the parallel annotation of
already existing parallel texts, or the translation and
annotation of an existing syntactically annotated corpus.
The up-to-now main parallel Czech-English resource, Reader's
Digest corpus, contains extremely free translations, which
has proved "difficult" in several machine-learning
experiments (Al-Onaizan, et al., 1999). Therefore, we decided
for the human translation of an existing monolingual
syntactically annotated corpus and its subsequent syntactic
annotation. This allows us to better control the translation
quality and reliability, and also reduces the necessary
annotation efforts.
The Wall Street Journal section of the Penn Treebank and the
Prague Dependency Treebank are corpora comparable in size
(about 1 million words), and they both contain syntactically
annotated newspaper texts. The choice of the Penn Treebank as
the source corpus was pragmatically motivated: firstly it is
a widely recognized linguistic resource, and secondly the
translators were native speakers of Czech, capable of high
quality translation into their native language. The
translators were asked to translate each English sentence as
a single Czech sentence and to avoid unnecessary stylistic
changes of translated sentences.
Since Czech is a language with relatively high degree of word-order
freedom, its sentences contain certain syntactic
phenomena, such as discontinuous constituents, which cannot
be straightforwardly handled using the annotation scheme of
Penn Treebank, based on phrase-structure trees. Therefore, we
decided to adopt the dependency-based annotation scheme of
PDT for the PCEDT.
While the morphological annotation of the English part is
simply taken over from the Penn Treebank, the analytical and
tectogrammatical markups of the English part of the corpus
are obtained by two independent procedures transforming the
phrase-structure trees into dependency ones.
Data
Czech-English Penn Treebank Corpus
Original Penn Trebank data
The CD contains a copy of the original (English) data from
Penn Treebank that were selected for translation. A unique ID
was assigned to each sentence.
Czech-English Raw Text Data
The original English data from Penn Treebank, transformed into raw
text, each sentence starting with a unique id, were given as a
source for human translators. The resulting Czech translations are
in raw text format (character set ISO-8859-2), each sentence
starts with a unique id that can be derived from the original id
of the English sentence.
Development and Evaluation Test Set
We selected a test set of 515 sentences for development and
evaluation. For the purpose of quantitative evaluation
methods for machine translation we also had them retranslated
from Czech into English by 4 different translator offices.
Automatic conversions of Penn Treebank annotation into
dependency structure
Except for the test part of the data, which is
tectogrammatically annotated by human annotators, the Czech
part of the corpus is annotated by automatic means starting
from the raw data, while the tools for automatic markup of
the English part make use of the existing annotation of the
Penn Treebank corpus.
Preprocessing of Penn Treebank
The general recursive transformation algorithm from phrase tree
topology into dependency one works as follows:
-
Terminal nodes of the phrase are converted to nodes of
the dependency tree.
-
Constituents of a non-terminal node are converted into
separate dependency trees. The root node of the
dependency tree transformed from the head constituent
becomes the main root. Dependency trees transformed from
the left and right siblings of the head constituent are
attached to the main root as the left and right children,
respectively.
-
Nodes representing traces are removed and their children
are reattached to the parent of the trace.
The concept of head of a phrase is
important when transforming the phrase tree topology into the
dependency one. We used Jason Eisner's scripts for marking
head constituents in each phrase.
Lemmatization - assigning base forms - is necessary
in almost all experiments with languages rich on morphology
(such as Czech). For English, the task is not of such
importance, more over, it is substantially simplified by
the fact that Penn Treebank data contain manually assigned
POS tags. The lemmatization procedure just searches the
list of all triples of word form, POS tag and lemma. The
list of 910,216 triples was obtained by running MXPOST tagger and morpha
lemmatization tool on a large corpus of English (365M
words, 13M sentences). The lemmatization procedure makes
two attempts to find a lemma: first, it tries to find a
triple with a matching word form and its (manually
assigned) POS, and then, if it fails, it makes a second
attempt with the word form converted to lowercase. If both
attempts fail, then the given word form is chosen as the
lemma.
For technical reasons, a unique identifier is
assigned to each token.
Annotation of English - Analytical Representation
The structural transformation works as described
above. Because the handling of coordination in PDT is
different from the Penn Treebank annotation style and the
output of Jason Eisner's head assigning scripts, in the case
of a phrase containing a coordinating conjunction
(CC), we consider the rightmost CC as the
head. The treatment of apposition is a more difficult task,
since there is no explicit annotation of this phenomenon in
the Penn Treebank; constituents of a noun phrase separated by
commas (and not containing CC) are considered to be
in apposition and the rightmost comma is the head. The
information from the phrase tree and the structure of the
dependency tree are both used for analytical function
assignment:
-
WSJ function tag to analytical function mapping: some
function tags of a phrase tree correspond to analytical
functions in an analytical tree and can be mapped to
them: SBJ->Sb, DTV->Obj,
LGS->Obj, BNF->Obj,
TPC->Obj, CLR->Obj,
ADV->Adv, DIR->Adv,
EXT->Adv, LOC->Adv,
MNR->Adv, PRP->Adv,
TMP->Adv, PUT->Adv.
-
Assignment of analytical functions using local context:
for assigning analytical functions to the remaining
nodes, we use simple rules taking into account POS and the
name of the constituent headed by a node in the original
phrase tree. In the rules this information for the
current node, its parent and grandparent can be used. For
example, the rule mPOS=DT|mAF=Atr assigns the
analytical function Atr to every determiner, the
rule mPOS=MD|pPOS=VB|mAF=AuxV assigns the
function tag AuxV to a modal verb headed by a
verb, etc. The attribute mPOS representing the
POS of the node is obligatory for every rule. The rules
are first examined in the order of the longest prefix
of the POS of the given node and secondly in the order
as they are listed in the rule file. The ordering of
rules is important since the first matching rule found
assigns the analytical function and the search is
finished.
Specifics of the PDT and Penn Treebank annotation
schemes, mainly the markup of coordinations, appositions, and
prepositional phrases, are handled by this step:
-
Coordinations and appositions: the analytical function, which
was originally assigned to the head of a coordination or
apposition is propagated to its children nodes by attaching
the suffix _Co or _Ap to them and the head
node gets the analytical function Coord or
Apos, respectively.
-
Prepositional phrases: the analytical function originally
assigned to a preposition node is propagated to its
child and the preposition node is labeled AuxP.
-
Sentences in the PDT annotation style always contain a
root node labeled AuxS, which, as the only one
in the dependency tree, does not correspond to any
terminal of the phrase tree; the root node is inserted
above the original root. While in the Penn Treebank the
final punctuation is a constituent of the sentence
phrase, in the analytical tree it is moved under the
technical sentence root node.
Annotation of English - Tectogrammatical Representation
The transformation of Penn Treebank phrase trees into
tectogrammatical representation consists of a structural
transformation and the assignment of a tectogrammatical
functor and a set of grammatemes to each node of the
resulting tree.
At the beginning of the structural transformation, the
initial dependency tree is created by a general
transformation procedure as described above. However, there
are differences in the notion of head between phrasal
grammar and the guidelines for tectogrammatical annotation;
for example, the head of a prepositional phrase is not a
preposition. In the next step, nodes corresponding to
functional (synsemantic) words, such as prepositions,
punctuation marks, determiners, subordinating conjunctions,
certain particles, auxiliary and modal verbs are marked as
"hidden" and information about them is stored in their
governing nodes. The well-formedness of a tectogrammatical
tree structure requires the valency frames to be complete:
apart from nodes that are realized on surface, there are
several types of "restored" nodes representing the
non-realized members of valency frames (cf. pro-drop
property of Czech and verbal condensations using gerunds
and infinitives both in Czech and English). For a partial
reconstruction of such nodes, we can use traces, which
allow us to establish coreferential links or general
participants of the valency frames.
For the assignment of tectogrammatical functors, we can use
rules taking into consideration POS tags
(e.g. PRP->APP), function tags
(JJ->RSTR, JJR->CPR, etc.) and lemma
("not"->RHEM, "both"->RSTR).
Grammateme Assignment - morphological (e.g. Tense, Degree
of Comparison) and syntactic grammatemes (e.g.
TWHEN_AFT(er)) are assigned to each node of the
tectogrammatical tree. The assignment of the morphological
attributes is based on Penn Treebank tags and reflects basic
morphological properties of the language. The syntactic
grammatemes capture more specific information about deep
syntactic structure. At the moment, there are no automatic
tools for the assignment of the latter ones.
The whole procedure is described in detail in
Kučerová and Žabokrtský (2002).
The quality of such a transformation, based on comparison with
manually annotated trees, is about 6% of wrongly aimed
dependencies and 18% of wrongly assigned functors.
Automatic annotation of Czech
Morphological tagging of Czech
The Czech translations of Penn Treebank were automatically
tokenized and morphologically tagged, each word form was
assigned a basic form - lemma by
Hajič and Hladká (1998)
tagging tools.
Analytical parsing of Czech
Czech analytical parsing consists of a statistical dependency
parser for Czech - either Collins parser (Hajič
et al., 1998) or Charniak parser (Charniak, 1999), both adapted to
dependency grammar - and a module for automatic analytical
functor assignment (Žabokrtský et al.,
2002). Both versions of output from Collins and Charniak
parsers are present. For efficiency reasons, sentences longer
than 60 words were excluded from the corpus parsed by Collins
parser.
Transition to tectogrammatical representation of Czech
When building the tectogrammatical structure, the analytical
tree structure is converted into the tectogrammatical one.
These transformations are described by linguistic rules
(Böhmová, 2001). Then,
tectogrammatical functors are assigned by a C4.5 classifier
(Žabokrtský et al., 2002).
Manual Tectogrammatical Annotation of Czech and English
Since there are no guidelines for tectogrammatical annotation of
English yet, and in order to acquire some initial experience
before the work on the guidelines begins, a "gold standard"
tectogrammatical annotation of more than 1,000 sentences has
been done. These data are assigned morphological grammatemes
(the full set of values) and syntactic grammatemes, and the
nodes are reordered according to topic-focus-articulation
(information structure). The manually annotated sentences
comprise the whole development and evaluation test set. Also the
Czech counterpart of the test set has been manually annotated
according to the guidelines for tectogrammatical annotation of
Czech.
Reader's Digest Corpus
This corpus contains parallel raw text of 450 articles from
the Reader's Digest, years 1993-1996. The Czech part is a
translation of the English one. Sentence pairs were aligned
automatically by Dan Melamed's SIMR/GMA
tool. Since the translations in this corpus are relatively
free, only 43969 of 54091 aligned segments contain 1-to-1
sentence alignments.
Czech Monolingual Corpus
The electronic text sources have been provided by the Institute of Czech
National Corpus. Originally, all data come from news
articles which were published in the daily newspaper Lidove Noviny,
1994-1995. The inner format of the data corresponds to the
csts format. The total data amount is more than 39M tokens
(words proper + punctuation) in about 2385K sentences.
Dictionaries
Czech-English Probabilistic Dictionary
This dictionary was compiled from translations of lists of words
extracted from Czech and English monolingual frequency
dictionaries of base forms. For the translation of word lists we
used three different Czech-English manual dictionaries: two of
them were available on the Web (WinGED and GNU/FDL) and one was
extracted from Czech and English EuroWordNets. Word-translation
pairs were filtered and weighed taking into account the
reliability of the source dictionary, the frequencies of the
translations in the English monolingual corpus, and the
correspondence of the Czech and English POS tags. Furthermore, by
training GIZA++ translation model on the training part of the
PCEDT extended by the manual dictionaries, we obtained a
probabilistic Czech-English dictionary, more sensitive to the
specific domain of financial news typical for the Wall Street
Journal part of PTB.
The resulting Czech-English probabilistic dictionary contains
46,150 word-translation pairs.
The Czech monolingual frequency dictionary of base forms was
compiled from 455,689,875 running words from the Czech National Corpus.
The English monolingual frequency dictionary of base forms
was compiled from 310,308,540 running words from the North
American News Text Collection.
The following sources were used for the translation of lists of
words:
The dictionary contains one pair consisting of Czech word and its
English translation per line.
Each line consists of six strings separated by space
characters:
-
Czech base form and the first letter of POS tag (joined by
a hash (#) character)
-
English base form and the first letter of POS tag (joined by
hash (#) character)
-
Conditional probability P(En|Cz) from GIZA++ training
-
Conditional probability P(Cz|En) from GIZA++ training
-
Count of the Czech base form from the Czech monolingual
corpus
-
Count of the English base form from the English monolingual
corpus
Here is a sample of the dictionary:
Cz En P(En|Cz) P(Cz|En) cnt(Cz) cnt(En)
obvykle#R typically#R 0.16 0.4 40074 6428
typicky#R typically#R 0.8 0.6 3445 6428
typicky#R characteristically#R 0.2 0.24 3445 174
NOTES:
-
Multi-word entries or translations such as "take_off" are
joined by the underscore character, for example:
"take_off#V"
-
Sum of P(En|Cz) probabilities for a particular Czech word
(Cz) is less or equal 1. It doesn't necessarily sum to 1
because some of the pairs from GIZA++ table might have been
omitted from the dictionary.
-
One can obtain the probability of a Czech word by dividing
cnt(Cz) by the number of running words in the Czech monolingual
corpus (i.e. 455,689,875) and the probability of an English
word by dividing cnt(En) by the number of running words in
the English monolingual corpus (i.e. 310,308,540).
Czech-English Dictionary of Word Forms
The PCEDT also comprises a Czech-English translation
dictionary of word forms. This dictionary was generated from
the Czech-English Probabilistic Dictionary and from lists of
word forms which occur more than 100 times in the Czech and
English monolingual corpora mentioned in the previous
section. This dictionary contains 496,673 word-translation
pairs. For example, for the word-translation pair
"bankéř" - "banker" in the Czech-English Probabilistic
Dictionary the Czech-English Dictionary of Word Forms contains
the following lines (comments in green are not a part of the data):
bankéř banker nominative sg.
bankéře banker genitive + accusative sg.
bankéře bankers accusative pl.
bankéřem banker instrumental sg.
bankéři banker dative + vocative + locative sg.
bankéři bankers nominative + vocative pl.
bankéřů bankers genitive pl.
bankéřům bankers dative pl.
Note that since Czech "bankéři" is ambiguous for
singular dative, vocative, or locative, and for plural
nominative or vocative it can be translated into English in both
singular and plural.
GNU/FDL English-Czech Dictionary
This is an English-Czech Dictionary provided by Milan Svoboda
at http://slovnik.zcu.cz under GNU
FDL (Free documentation license). The version included on
PCEDT was downloaded on 12th February 2004 and contains
115,929 word-translation
pairs (and about 81,500 not translated English
entries). Daily updates (of the file
slovnik_data.txt) are available at http://slovnik.zcu.cz/download.php.
The format of this dictionary is plain text. Except for comments
marked by '#', each line begins with an English word, its Czech
translation is in the second field (fields are separated by
tabs). The line may continue with additional information, such as
POS tag, domain of use, or author of the translation.
Data Sizes
Description of Data | #sentences | #words |
PTB Corpus: English part | | |
- manually annotated on tectogrammatical level | 1,257 | 33,980 |
- automatically transformed into analytical & tectogrammatical levels | 49,208 | 1,173,766 |
- retranslated by 4 different human translators | 515 | 13,143 |
PTB Corpus: Czech part | | |
- manually annotated on tectogrammatical level | 472 | 11,077 |
- automatically parsed into analytical & tectogrammatical levels | 21,656 | 487,920 |
Reader's Digest Corpus | 43,969 | 659,059 |
Czech Monolingual Corpus - Lidové Noviny | 2,385,000 | 39,000,000 |
Translation Dictionaries | #entry-translation pairs |
- Czech-English probabilistic dictionary | 46,100 |
- Czech-English dictionary of word forms | 496,673 |
- English-Czech dictionary under GNU/FDL | 115,929 |
Tools
SMT Quick Run
SMT Quick Run is a package of scripts and instructions for
building statistical machine translation system from the PCEDT or
any other parallel corpus.
Follow instructions at SMT
Quick Run Package page.
Tree Editor TrEd
Tree Editor (TrEd) is a graphical editor and viewer of tree
structures. Internally TrEd works with files in the so-called
FS-format and used for analytical and tectogrammatical
dependency trees. TrEd has a modular architecture allowing
custom input/output modules to be created in order to support
other data formats.
TrEd supports the following platforms:
- Windows 95/98/ME or Windows NT/2000/XP (TM)
- Linux
- BSD, UNIX, Solaris (TM) and other UNIX-based systems
See installation instructions and documentation at TrEd Package page
or at TrEd Homepage.
TrEd handles files in both FS (*.fs) and CSTS-SGML (*.csts)
formats.
NetGraph
Netgraph is a multi-platform client-server application
allowing you to browse, select and view analytical and
tectogrammatical dependency trees. It can either view Czech
trees from Prague Dependency Treebank (PDT) on the remote
server located at the Institute of Formal and Applied
Linguistics in Prague, or you can install your own server
for viewing trees from PCEDT.
See NetGraph Client
Manual and NetGraph
Homepage
for instructions, how to install and set up the NetGraph
client, and NetGraph Server
Manual for installing the server.
NetGraph reads files in FS-format (*.fs).
References
About Prague Czech-English Dependency Treebank
-
Martin Čmejrek, Jan Cuřín, Jiří Havelka. 2003. Treebanks in Machine
Translation, In Proceedings of The Second Workshop
on Treebanks and Linguistic Theories, Vaxjo, Sweden,
pp. 209-212. Available in PostScript or PDF.
-
-
Jan Cuřín, Martin Čmejrek, Jiří Havelka, Vladislav
Kuboň. 2004. Building Parallel Bilingual Syntactically
Annontated Corpus, In Proceedings of The First International
Joint Conference on Natural Language Processing, Hainan
Island, China, pp. 141-146. Available in PDF.
-
-
Martin Čmejrek, Jan Cuřín, Jiří Havelka. 2004. Prague
Czech-English Dependecy Treebank: Any Hopes for a Common
Annotation Scheme?, In HLT/NAACL 2004 Workshop:
Frontiers in Corpus Annotation, Boston, Massachusetts,
pp. 47-54. Available in PostScript or PDF.
-
-
Martin Čmejrek, Jan Cuřín, Jiří Havelka, Jan Hajič, Vladislav
Kuboň. 2004. Prague Czech-English Dependecy Treebank:
Syntactically Annotated Resources for Machine Translation,
In 4th International Conference on Language Resources and
Evaluation, Lisbon, Portugal. Available in PostScript or PDF.
-
Prague Dependency Treebank
-
Eva Hajičová, Jarmila Panevová, and
Petr Sgall. 2002. A Manual for Tectogrammatic Tagging
of the Prague Dependency Treebank. Technical Report
TR-2000-09, ÚFAL MFF UK, Prague, Czech
Republic,
-
-
Jan Hajič et al., 2001. A Manual for Analytic
Layer Tagging of the Prague Dependency Treebank,
Prague, Czech Republic, English translation of the
original Czech version. Available in PDF at
http://quest.ms.mff.cuni.cz/pdt/Corpora/PDT_1.0/Doc/aman-en/index.html
-
Statistical Machine Translation
-
Al-Onaizan Yaser, Jan Cuřín, Michael Jahr,
Kevin Knight, John Lafferty, Dan Melamed, Franz-Josef
Och, David Purdy, Noah A. Smith, David Yarowsky. 1999.
Statistical Machine Translation. Final Report,
JHU Summer Workshop'99. Available at
http://www.clsp.jhu.edu/ws99/projects/mt/final_report/mt-final-report.ps
-
-
Ulrich Germann. 2003. Greedy Decoding for Statistical
Machine Translation in Almost Linear Time. In
Proceedings of
HLT-NAACL-2003, Edmonton, Canada.
-
Franz Josef Och and Hermann Ney. 2003. A Systematic
Comparison of Various Statistical Alignment Models,
Computational Linguistics, volume 29, number 1,
pp. 19-51.
-
Structural Machine Translation
-
Martin Čmejrek, Jan Cuřín, Jiří Havelka. 2002. Czech-English
Dependency-based Machine Translation: Data Preparation
for the Starting up Experiments, The Prague
Bulletin of Mathematical Linguistics, Volume 78, pp.
103-116. Available in PostScript or PDF.
-
-
Martin Čmejrek, Jan Cuřín, Jiří Havelka. 2003. Czech-English
Dependency-based Machine Translation, In
Proceedings of the 10th Conference of The European
Chapter of the Association for Computational
Linguistics, Budapest, Hungary, pp. 83-90. Available
in PostScript or PDF.
-
Other Related References
-
Alena Böhmová. 2001.
Automatic Procedures in Tectogrammatical Tagging.
The Prague Bulletin of Mathematical Linguistics,
76.
-
-
Eugene Charniak. 1999. A
maximum-entropy-inspired parser. Technical Report
CS-99-12.
-
-
Jan Hajič, Barbora Vidová-Hladká.
1998. Tagging Inflective
Languages: Prediction of Morphological Categories for a
Rich, Structured Tagset. In Proceedings of the
Conference COLING - ACL `98 Montreal, Canada.
-
-
Jan Hajič, Barbora Vidová-Hladká, Dan Zeman, Michael Collins,
Lance Ramshaw, Christoph Tillmann, Eric Brill, D. Jones,
C. Kuo, O. Schwartz. 1998. Core Natural Language
Processing Technology Applicable to Multiple
Languages. The Prague Bulletin of Mathematical
Linguistics, 70.
-
-
Petr Sgall, Zdeněk Žabokrtský, Sašo
Džeroski. 2002.
A Machine Learning Approach to Automatic Functor
Assignment in the Prague Dependency Treebank. In
Proceedings of the Third International Conference on
Language Resources and Evaluation (LREC 2002), pp.
1513--1520. ELRA.
-
-
Zdeněk Žabokrtský, Ivona Kučerová.
2002. Transforming Penn Treebank phrase trees into
(Praguian) tectogrammatical dependency trees, The
Prague Bulletin of Mathematical Linguistics, 78. On-line version at http://ckl.mff.cuni.cz/~zabokrtsky/wsj2tgts