Corpus Name: English Chinese Translation Treebank, v. 1.0 Language(s): English, translated from Chinese Authors: Ann Bies, Martha Palmer, Colin Warner, Justin Mott Publication Type: Text Source Data Type: Text Source Data Genre: Newswire Project: TIDES, NSF Data Source Type: Text File Format(s): Text Size: 7.5 MB Tokens: 146,300 words Applications: Natural language processing, parsing, tagging, machine translation Description: This release of the English Chinese Treebank consists of 146,300 words in 325 files of individual Xinhua news stories (corresponding to the Xinhua data in the Chinese Treebank 5.0, LDC Catalog No.: LDC2005T01) that are translated into English, part-of-speech tagged and treebanked. The files were compressed using gzip. The source files for the treebank annotation contain the final updated translation of these files. Translation errors that prevented complete treebank annotation have been corrected. This translation and annotation were completed in October 2004, and this supersedes any earlier translation. The data can be found in the following directories: Penn Treebank-style files (converted from WordFreak annotation files) /data/pennTB-style-trees/ Penn Treebank-style files (reformated and indented for easier human reading) /data/reformatted-PTBstyle/ WordFreak annotated files (the files as they were annotated using the Wordfreak annotation tool) /data/annotated-files/ Source text files (corrected translations) /data/rawtext-files/ The guidelines followed for both part-of-speech and treebank annotation are essentially Penn Treebank II style, with two notable differences: 1. POS: tokenization of hyphenated items ("New York-based" has been replaced by "New York - based" for example), and the addition of HYPH and AFX tags necessitated by this change in tokenization 2. TreeBank: the addition of the node label NML for sub-NP nominal constituents (replacing NX and most NP-internal NAC) More detailed addenda to the Penn Treebank II guidelines can be found in /docs/pos-guidelines-addenda.txt and /docs/treebank-guidelines-addenda.txt The Penn Treebank II style guidelines (Bracketing Guidelines for Treebank II Style, Eds: Ann Bies, Mark Ferguson, Karen Katz, Robert MacIntyre, Penn Treebank Project, University of Pennsylvania, CIS Technical Report MS-CIS-95-06, 1995) and the Penn Treebank part-of-speech tagging guidelines are available at http://www.cis.upenn.edu/~treebank/ The Wordfreak annotation tool is available at http://wordfreak.sourceforge.net/ Project manager: Ann Bies, bies@ldc.upenn.edu Annotators: Justin Mott Christine Brisson Alexandra Kinyon Colin Warner Corrected translations: Susan Converse Tsan-Kuang Lee ----------------------------- Ann Bies bies@ldc.upenn.edu Linguistic Data Consortium January 9, 2007 -----------------------------