1. Publication title: Chinese Treebank 7.0 (CTB7.0) 2. Authors: Nianwen Xue , Zixin Jiang , Xiuhong Zhang , Martha Palmer , Fei Xia , Fu-Dong Chiou , Meiyu Chang Contact: Nianwen Xue 3. Data type: text 4. Genres: Newswire: [0001-0325, 0400-0454, 0500-0540, 0600-0885, 0900-0931, 4000-4050] Magazine articles: [0590-0596, 1001-1151] Broadcast news:[2000-3145, 4051-4111] Broadcast converstations: [4112-4197] Weblogs: [4198-4411] 5. Project: the Chinese Treebank Project (http://www.cs.brandeis.edu/~clp/ctb) 6. Applications: natural language processing, parsing, inforamtion extraction, machine translation, linguistic analysis 7. Language: Chinese 8. Special license: None 9. Grant numbers and funding agencies: This research was funded by DOD MDA902-97-C-0307, DARPA TIDES N66001-00-1-8915, and DARPA GALE HR0011-06-0022. 10. Copyright. Portions Copyright 1994-1998, Xinhua News Agency Portions Copyright 1997, Department of Information Services, Hong Kong Special Administrative Region Portions Copyright 1996-1998 & 2000-2001, Sinorama Magazine 11. Description of the corpus structure and data attributes: There are 2,448 text files in this release, containing 51,447 sentences, 1,196,329 words, 1,931,381 hanzi (Chinese characters). The data is provided in the UTF-8 encoding, and the annotation has Penn Treebank-style labeled brackets. Details of the annotation standard can be found in the enclosed segmentation, POS-tagging and bracketing guidelines. The data is provided in four different formats: raw text, word segmented, word segmented and POS-tagged, and syntactically bracketed formats. 12. Quality control: The data is partially double-annotated and adjudicated. The list of double-annotated and adjudicated files are listed in 'gold-standard-files.txt' and they are the recommended test files for tool evaluation (e.g., automatic word segmentation, POS-tagging and parsing).All files have been automatically verified and manually checked. 13. References: Nianwen Xue, Fei Xia, Fu-Dong Chiou, and Martha Palmer. 2005: The Penn Chinese TreeBank: Phrase Structure Annotation of a Large Corpus. Natural Language Engineering, 11(2)207-238. Nianwen Xue, Fu-Dong Chiou, and Martha Palmer. 2002. Building a Large-Scale Annotated Chinese Corpus. In Proceedings of the 19th. International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan, 2002. Fei Xia, Martha Palmer, Nianwen Xue, Mary Ellen Okurowski, John Kovarik, Fu-Dong Chiou, Shizhe Huang, Tony Kroch, and Mitch Marcus. 2000. Developing Guidelines and Ensuring Consistency for Chinese Text Annotation, In Proceedings of the second International Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece, 2000.