README file for the Czech-English Dependency Treebank, version 1.0 5/19/2004 ------------------------------------------------------------------- Authors: Martin Cmejrek, Jan Curin, Jan Hajic, Jiri Havelka, Vladislav Kubon, Zdenek Zabokrtsky Project Leader: Jan Hajic Address: Center for Computational Linguistics and Institute of Formal and Applied Linguistics Malostranské nám. 25 118 00 Praha 1 Czech Republic http://ckl.mff.cuni.cz http://ufal.mff.cuni.cz phone: +420 221 914 278 fax: +420 221 914 309 e-mail: pcedt@ufal.mff.cuni.cz License: http://ufal.mff.cuni.cz/pcedt/pcedt1.0-register.html (c) 2002-2004 Center for Computational Linguistics and Institute of Formal and Applied Linguistics ------------------------------------------------------------------- INTRODUCTION Prague Czech-English Dependency Treebank version 1.0 (PCEDT 1.0) is a corpus of Czech-English parallel resources suitable for experiments in machine translation, with a special emphasis on dependency-based (structural) translation (with evaluation data provided for Czech-to-English systems). DOCUMENTATION doc/PCEDT_main.html ... browsable documentation of the CD doc/PCEDT_license.html ... PCEDT 1.0 license doc/csts.html ... link to csts document type description doc/fs.html ... FS-format description doc/README.txt ... this file DATA Formats of data on PCEDT 1.0 are FS-format, CSTS-format (read description of these formats in doc/ directory), or raw text. Character encoding for Czech texts is ISO-8859-2 (English texts are in ASCII). Use TrEd or NetGraph to view trees in both, CSTS and FS formats. Brief description for individual data packages follows, for more detailed description please refer to separate README.txt files located in appropriate directories. Czech-English Parallel Penn Treebank Corpus ------------------------------------------- For the purposes of comarable evaluation data in this corpus are divided into three sets: development, evaluation and training data/PTB_corpus/original/En_[development|evaluation|training] ... original PTB-style annotation of English data/PTB_corpus/raw/[Cz|En]_[development|evaluation|training] ... 49k sentences of English raw texts and 21k sentences of its Czech translations data/PTB_corpus/reference_translations/En_[development|evaluation] ... 4 reference retranslations from Czech to English of 500 sentences data/PTB_corpus/NIST_format/[Cz|En]_[development|evaluation|training] ... 22k parallel sentences in NIST mt evaluation format (dtd/mt.dtd) data/PTB_corpus/automatic_tagged/Cz_[development|evaluation|training] ... morphologically analyzed and automatically tagged Czech translations data/PTB_corpus/automatic_AR/En_[development|evaluation|training] ... automatic conversion of English part into analytical representation data/PTB_corpus/automatic_AR/Cz_[development|evaluation|training] ... automatically parsed Czech part, by Collins and Charniak parsers for Czech data/PTB_corpus/automatic_TR/En_[development|evaluation|training] ... automatic conversion of English part into tectogrammatical representation data/PTB_corpus/automatic_TR/Cz_[development|evaluation|training] ... automatic conversion of Czech AR trees into tectogrammatcal representation data/PTB_corpus/manual_TR/En_[development|evaluation|training] ... manual annotation on tectogrammatical level of 1275 English sentences data/PTB_corpus/manual_TR/En_[development|evaluation|training] ... manual annotation on tectogrammatical level of 515 Czech sentences Other Text Corpora ------------------ data/RD_corpus/raw/[Cz|En|Align] ... Reader's Digest parallel corpus data/Czech_raw_text/ ... 39M-word Czech monolingual corpus (tokenized, in csts-format) Translation Dictionaries ------------------------ data/Dictionaries/ ... Czech-English Translation Dictionaries TOOLS For instructions how to install and use tools included on the CD please refer to: tools/TrEd/Doc/TrEd.html for TrEd tools/NetGraph/Doc/netgraph_manual.html for NetGraph client tools/NetGraph/Doc/netgraph_server_install.html for NetGraph server tools/SMT_QuickRun/Doc/SMT_QuickRun.html for SMT Quick Run Package