Chinese Treebank 8.0 consists of approximately 1.5 million words of annotated
and parsed text from Chinese newswire, government documents, magazine articles,
various broadcast news and broadcast conversation programs, web newsgroups and
The Chinese Treebank project began at the University of Pennsylvania in 1998,
continued at the University of Colorado and then moved to Brandeis
University. The project goal is to provide a large, part-of-speech tagged
and fully bracketed Chinese language corpus. The first delivery, Chinese Treebank
1.0, contained 100,000 syntactically annotated words from Xinhua News Agency
newswire. It was later corrected and released in 2001 as Chinese
Treebank 2.0 (LDC2001T11) and consisted of approximately 100,000 words.
LDC released Chinese Treebank
4.0 (LDC2004T05), an updated version containing roughly 400,000 words, in
2004. A year later, LDC published the 500,000 word Chinese
Treebank 5.0 (LDC2005T01). Chinese
Treebank 6.0 (LDC2007T36), released in 2007, consisted of 780,000 words.
Chinese Treebank 7.0 (LDC2010T08),
released in 2010, added new annotated newswire data, broadcast material and
web text to the approximate total of one million words. Chinese Treebank 8.0
adds new annotated data from newswire, magazine articles and government documents.
There are 3,007 text files in this release, containing 71,369 sentences, 1,620,561
words, 2,589,848 characters (hanzi or foreign). The data is provided in UTF-8
encoding, and the annotation has Penn Treebank-style labeled brackets. Details
of the annotation standard can be found in the segmentation, POS-tagging
and bracketing guidelines included in this release. The data is provided in four different formats: raw
text, word segmented, POS-tagged and syntactically bracketed
formats. All files were automatically verified and manually checked.
Please view samples in each format:
This work was supported in part by the Defense Advanced Research Projects Agency
GALE Program Grant No. HR0011-06-0022 and BOLT Program No. HR0011-11-C-0145.
The content of this publication does not necessarily reflect the position or
the policy of the Government, and no official endorsement should be inferred.
None at this time.
Portions © 2006 Agence France Presse, © 2006 Anhui TV, © 2005
Cable News Network, LP, LLLP, © 2000-2001 China Broadcasting System, ©
2000-2001, 2005-2006 China Central TV, © 2000-2001 China National Radio,
© 2006 Chinanews.com, © 2000-2001 China Television System, ©
2006 Guangming Daily, © 2006 National Broadcasting Company, Inc. ©
2006 New Tang Dynasty TV, © 2006 Peoples Daily Online, © 2005-2006
Phoenix TV, © 1996-2001 Sinorama Magazine, © 1997 The Government of
the Hong Kong Special Administrative Region, © 1994-1998, 2006 Xinhua News
Agency, © 2001, 2004, 2005, 2007, 2009, 2010, 2013 Trustees of the University