Item Name: Chinese Proposition Bank 1.0
Author(s): Martha Palmer, Nianwen Xue, Zixin Jiang, Meiyu Chang
LDC Catalog No.: LDC2005T23
ISBN: 1-58563-354-2
ISLRN: 731-738-468-307-2
Release Date: September 20, 2005
Member Year(s): 2005
DCMI Type(s): Text
Data Source(s): newswire
Project(s): GALE, TIDES
Application(s): natural language processing
Language(s): Mandarin Chinese
Language ID(s): cmn
Citation: Palmer, Martha, et al. Chinese Proposition Bank 1.0 LDC2005T23. Web Download. Philadelphia: Linguistic Data Consortium, 2005.
Chinese Proposition Bank 1.0 was developed by the Linguistic Data Consortium (LDC) and contains predicate-argument relations for approximately 37,000 propositions annotated in 250,000 words of Chinese text.

Chinese Proposition Bank 1.0 is the first public release of the Penn Chinese Proposition Bank project, which aims to create a corpus of text annotated with information about basic semantic propositions. Specifically, predicate-argument relations have been added to the syntactic trees of the first update to Chinese Treebank 5.0 (LDC2005T01) as an additional layer of annotation.

There are two later versions of this corpus:


Chinese Proposition Bank 1.0 includes annotations for files chtb_001.fid to chtb_931.fid, or the first 250K words of the first update of Chinese Treebank 5.0. There is a total of 37,183 propositions. Auxiliary verbs are not annotated. Some verbs have light verb and non-light verb uses; in these cases only the non-light verbs are annotated. All the annotations in this release are the result of double blind annotation followed by adjudication of differences.

The following table summarizes the framesets in CPB 1.0:

Total verbs framed 4,865
Total framesets 5,298
Verbs with multiple framesets 351
Average framesets per verb 1.09

Each predicate-argument structure is represented in a line of space separated columns. The columns are as follows:

  • ctb-filename: the name of the file in the Penn Chinese TreeBank 5.0 update 1.
  • sentence: the number of the sentence in the file (starting with 0).
  • terminal: the number of the terminal in the sentence that is the location of the verb.
  • tagger: the name of the annotator, or "gold" if it's been double annotated and adjudicated.
  • frameset: identifier from the frames file of the verb.
  • inflection: a carry-over from the Penn English Proposition Bank, no annotation in the Chinese Proposition Bank.
  • arglabel: a string representing the annotation associated with a particular argument or adjunct of the proposition in three columns: address of constituent, label, and functional tag.


For an example of the data in this corpus, please view this sample (XML).


