Chinese Treebank 4.0 (CTB 4.0) Principal Investigator: Martha Palmer Project Manager: Fu-Dong Chiou Annotators: Fu-Dong Chiou, Nianwen Xue & Tsan-Kuang Lee Programming support: Jeremy LaCivita Chinese Treebank is an ongoing project. CTB 1.0 (100K words) was first published in 2000. CTB 2.0 corrected an error in CTB 1.0 and was published in 2001. 150K more words were then added to the corpus and the whole corpus of 250K words was released as an e-corpus in 2003 (CTB 3.0, LDC2003E06). Another 150K words have been added to the corpus since then, to create CTB 4.0 (400K words). The following table summarizes the development of the CTB corpus: Filename CTB 1.0 CTB 2.0 CTB 3.0 CTB 4.0 1-325 x x x x 400-454 x x x x 500-554 x x 590-596 x x 600-885 x x 900-931 x x 1001-1078 x 22 new characters have been added to the CTB 4.0 release (as opposed to the original source data). They are predominately the result of converting the original ASCII code of some punctuation marks or numerals (i.e., 0-9) into GB code. The following is the list of all 22 tokens: ASCII-to-GB Comma , [9 tokens] filename location chtb_006.fid chtb_129.fid chtb_314.fid chtb_553.fid chtb_553.fid chtb_553.fid chtb_631.fid chtb_745.fid chtb_815.fid ASCII-to-GB Closing Double Quotation " [1 token] filename location chtb_715.fid ASCII-to-GB Arabic Numbers (1, 1, 2 & 3) [4 tokens] filename location chtb_830.fid ASCII -LBR- to GB ( & ASCII -RBR- to GB ) [2 tokens each file] filename location chtb_088.fid chtb_102.fid chtb_140.fid chtb_159.fid February 27, 2004 Fu-Dong Chiou chioufd@unagi.cis.upenn.edu