This directory contains the data of the UPenn Chinese Propbank. This data is collected as an additional layer of annotation on the Penn Chinese Treebank 5.1 (ctb5.1), representing the predicate argument structure of verbs. Below is a list of each file and a description of its contents. File Description -------------------------------------------------------------------------------------- cpb1.0.txt The annotated data, file format described below. This includes the annotations for files chtb_001.fid to chtb_931.fid, the first 250K words of the Penn Chinese TreeBank frames/ Frame files that serve as Lexical Guidelines. The file format for each verb is detailed in frames/verb.dtd html_frames/ Frame files in html format verbs.txt verb list extracted from files chtb_001.fid to chtb_931.fid of the Penn Chinese Treebank. NOTES.txt release notes for Chinese Proposition Bank 1.0 (cpb-1.0) -------------------------------------------------------------------------------------- Annotation Format. The cpb1.0.txt file contains predicate argument structures of verbs. Each P-A structure is represented in a line of space separated columns. The columns are as follows ctb-filename sentence terminal tagger frameset inflection arglabel arglabel ... The content of each column is described in detail below. ctb-filename the name of the file in the Penn Chinese TreeBank, version 5.1 (ctb5.1) sentence the number of the sentence in the file (starting with 0) terminal the number of the terminal in the sentence that is the location of the verb. Note that the terminal number counts empty constituents as terminals and starts with 0. This will hold for all references to terminal number in this description. An example: (IP (NP-SBJ (DNP (NP (NN 货币)(NN 回笼))(DEG 的))(NP (NN 增加)))(PU ,) (VP (PP-BNF (P 为)(IP (NP-SBJ (-NONE- *PRO*))(VP (VV 平抑)(NP-OBJ (NP (DP (DT 全)) (NP (NN 区)))(NP (NN 物价))))))(VP (VV 发挥)(AS 了)(NP-OBJ (NN 作用)))) (PU 。)) The terminal numbers: 货币 0 回笼 1 的 2 增加 3 ,4 为 5 *PRO* 6 平抑 7 全 8 区 9 物价 10 发挥 11 了 12 作用 13 。14 tagger the name of the annotator, or "gold" if it's been double annotated and adjudicated. frameset The frameset identifier from the frames file of the verb. For example, '发挥.01' refers to the frameset ID "f1" in the frame file for the verb '发挥' (frames/0930-fa-hui.xml). The names of the frame files are composed of numerical id, plus the pinyin of the verb. The numerical ids can be found in the enclosed verb list (verbs.txt). inflection The inflection field is a carry-over from the Penn English Proposition Bank, and is set to '-----', meaning no annotation in the Chinese Proposition Bank. arglabel A string representing the annotation associated with a particular argument or adjunct of the proposition. Each arglabel is dash '-' delimited and has the following columns 1) column for the address of a constituent The address of the constituent are in one of the two forms. form 1: : A single node in the syntactic tree of the sentence in question, identified by the first terminal the node spans together with the height from that terminal to the syntax node (a height of 0 represents a terminal). For example, in the sentence (IP (NP-TPC (DP (DT 这些))(CP (WHNP-1 (-NONE- *OP*)) (CP (IP (NP-SBJ (-NONE- *T*-1)) (VP (ADVP (AD 已))(VP (VV 开业))))(DEC 的)))(NP (NN 外商)(NN 投资)(NN 企业))) (NP-ADV (NN 绝大部分))(NP-SBJ (NN 生产)(NN 经营)(NN 状况))(VP (ADVP (AD 较)) (VP (VA 好)))(PU 。)) the address of "1:3" represents the top IP node and 2:2 represents the CP node form 2: terminal number:height*terminal number:height*... A trace chain identifying coreference within sentence boundaries. For example in the sentence (IP (NP-TPC (DP (DT 这些))(CP (WHNP-1 (-NONE- *OP*)) (CP (IP (NP-SBJ (-NONE- *T*-1)) (VP (ADVP (AD 已))(VP (VV 开业))))(DEC 的)))(NP (NN 外商)(NN 投资)(NN 企业))) (NP-ADV (NN 绝大部分))(NP-SBJ (NN 生产)(NN 经营)(NN 状况))(VP (ADVP (AD 较)) (VP (VA 好)))(PU 。)) the address of of "2:0*1:0*6:1" represents the fact nodes '2:0' (-NONE- *T*-1), '1:0' (-NONE- *OP*) and '6:1' (NP (NN 外商)(NN 投资)(NN 企业)) are coreferential. form 3: terminal number:height,terminal number:height,... This represents a collection of different pieces of one argument. This form is rarely used in the annotation of the verbs, since most discontinuous constituents have well-defined relations between their components. Therefore the components of a discontinuous constituent are assigned the same label with a secondary tag representing their semantic relations. For example, if a constituent is marked as ARG0-CRD, it means that there is another constituent having the same label and together they fill the ARG0 role of the verb. 2) column for the 'label' The argument label one of {rel, ARGM} + { ARG0, ARG1, ARG2, ... }. The argument labels correspond to the argument labels in the frames files (see ./frames). ARGM for adjuncts of various sorts, and 'rel' refers to the surface string of the verb. 3) column for 'functional tag' (optional for numbered arguments; required for ARGM) Functional tags for "split" numbered arguments: PSR - possessor PSE - possessee CRD - coordinator PRD - predicate QTY - quantity Propositional tags for numbered arguments: AT, AS, INTO, TOWARDS, TO, ONTO Functional tags for ARGM: ADV - adverbial, default tag BNF - beneficiary CND - conditional DIR - directional DIS - discourse DGR - degree EXT - extent FRQ - frequency LOC - location MNR - manner NEG - negation** PRP - purpose and reason TMP - temporal TPC - topic **Although we have "NEG" in our tagset to be compatible with the English Proposition Bank, in this release we do not have proposition elements that are labeled "NEG". Negation markers in the Chinese Treebank are marked as adverbials with the "ADV" tag based on their syntactic behavior and this practice is carried over to the Chinese Proposition Bank. Since negation markers form a closed set, given the syntactic structures in the Treebank, they should be easy to detect if this is desirable. In this release, the negation markers include 不,未,从未,没, 切勿,别,莫,非,否. We plan to mark them explicitly with the functional tag "NEG" in future releases. More Information about this project can be found at www.cis.upenn.edu/~chinese/cpb --------------------------------------------------------------------------------------