Penn Treebank II guidelines ("Bracketing Guidelines for Treebank II Style, Penn Treebank Project," Ann Bies, Mark Ferguson, Karen Katz, and Robert MacIntyre. University of Pennsylvania Computer and Information Science Department Technical Report MS-CIS-95-06, LINC LAB 281) were followed as closely as possible, with the following changes: 1. tokenization of hyphenated items ("New York-based" has been replaced by "New York - based" for example) 2. the addition of the node label NML for sub-NP nominal constituents (replacing NX and most NP-internal NAC) 3. the use of the X node primarily to mark stray translation errors and tokens that could be removed from the sentence (extra determiners, e.g.) 4. a more consistent use of PRN to mark parentheticals only without creating additional structure in the surrounding tree Specific treebank guideline addenda are below (with thanks to Colin Warner for collecting and writing up the changes). Please see /docs/pos-guidelines-addenda.txt for information on the tokenization and tagging of hyphenated items. *** NML *** 1. Nominal Subconstituents NML is used to mark nominal subconstituents that do not follow our assumed right-branching default structure: (NP the (NML Hong Kong) economy) (NP (NML high level) economic talks) (NP the (ADJP (NML New York) based) company) Premodifiers of nouns and adjectives (updates to sections 11.1.1 and 11.1.2) a. Default constituency within nominals We assume a default right-branching structure under any NP and NML node. Each daughter of the phrase (whether single token or itself constituent node) is assumed to have scope over everything to its right. This means that every daughter also forms a constituent with everything to its right. This default structure does not apply to NP or NML nodes that have coordinated elements or an apposition structure as daughters, although it can be applied separately to each of those coordinated or appositive elements. This assumption makes the annotation process for multi-token nominals less complex and the resulting trees more legible, but still allows us to readily derive constituent nodes not explicitly represented. For example, in (NP primary liver cancer) we assume that "liver cancer" is a constituent, and that "primary" has scope over it. (NP a point mutation) "a" has scope over the constituent "point mutation" The ability to derive a constituent node for an NP minus its determiner may be useful in aligning syntactic nodes with entities, which may not consistently contain determiners. b. The NML node label: marking nominal modifiers We use the NML node label to mark nominal subconstituents that do not follow the default right-branching structure. Any two or more non-final elements that form a constituent are bound together by NML. (NP (NML Cytochrome P450) isoenzyme) (NP selective (NML seratonin reuptake) inhibitor) (NP (NML gel electrophoresis) analysis) (NP (NML 5 nM) tetrachlorobiphenyl) (NP (NML human liver tumor) analysis) These nominal subconstituents can contain PPs: (NP the (NML (NML guanine) (PP to (NP cytosine))) transformation) (NP (NML (NML Secretary) (PP of (NP State)) James Baker) Note that NML replaces the use of NAC in nominal modifiers as outlined in section 11.1.2. NML can also mark multi-token nominal elements modifying an adjective: (NP the (ADJP (NML New York) - based) company) (NP (ADJP (NML Hoe 234) induced) relaxation) c. Head derivation in NP and NML: The head of an NP or NML is either the rightmost noun (NN or NNS), or is contained within the rightmost NP or NML node. Recursive applications of this rule can yield the head of phrases containing nested NP or NMl nodes. These rules do not apply to coordinated or appositive structures, which have multiple heads, although they can be applied to determine the heads of the individual elements of those structures. This process of head derivation will not return a head for structures such as (NP the rich) or (NP 12). We can assume either that the NP has a null head, that it's headless, or that JJx or CD can serve as head when a nominal head cannot be derived. 2. Coordinated Premodifiers Coordinated premodifiers form a constituent node, typically ADJP, UCP or NML. Following standard policy for coordination, the individual coordinated elements only receive syntactic nodes if one or more of them is multi-token. (NP this (NML (NML large scale) and (NML high level)) international convention) (NP (UCP JJdomestic and RBoverseas) markets) (NP (UCP (ADJP scientific) and (ADJP technological) and (NML software development)) companies) (NP the (NML (NML Red Cross) and (NML Red Crescent)) movement) (NP his (ADJP energetic and powerful) performance) 3. Coordinated heads with shared premodifiers. In a coordinated NP, any modifiers that are left flat are assumed to be shared across all the heads: (NP China macroscopic ecnomic readjustment and control) (NP (NP China's) international income and expenses) Unshared modifiers in a coordinated structure must form a constituent node with the head they are modifying, as in standard NP coordination: (NP (NP material civilization) and (NP spiritual culture)) When unshared and shared modifiers are combined, the above structure is preserved (using the NML node label) for the unshared components. That is each of the coordinated elements is marked as a constituent and the elements together form a constituent. Any unshared modifiers are left as sisters to the coordinated structure: (NP socialist (NML (NML material civilization) and (NML spiritual culture))) (NP (NML New Zealand) (NML (NML industries) and (NML business circles))) This means that occasionally we have to show the scope of determiners: (NP the (NML (NML international income) and (NML domestic expenses))) *** PRN *** Parentheticals are meant to be treated as though they are completely separate from the main text. They do not have any syntactic relationship to the rest of the sentence in which they fall. So they should have all the internal structure they can, but should not contribute any extra structure to the main S (ie, they shouldn't be adjoined). They should be put at whichever level of the tree they most seem to modify. For example: (NP 2 cases (PRN ( (NP 67%) ))) (NP (NP 2) (PP of (NP 3 cases)) (PRN ( (NP 67%) ))) but NOT: (NP (NP 2 cases) (PRN ( (NP 67%) ) )) Note that if a PRN modifies a single-token, we do add extra structure to the sentence so it can be attached at the right level. (But this isn't really adding structure to the sentence, it's just a consequence of the fact that we can't hang things off POS tags.) (NP Kyoto and Tokyo) versus (NP (NP Kyoto (PRN ( (NP South Ward) ))) and (NP Tokyo (PRN ( (NP North Ward) )))) If there are multiple unrelated items inside a parenthetical, they do not need to be bound together by FRAG. That is, a PRN can have multiple daughters. If there are two items that have some relationship that is not expressible with normal syntactic nodes, they should be bound together with FRAG. Bare VPs inside a PRN should not receive a null subject: "(expressing X)" (PRN ( (VP expressing (NP X)) )) *** The use of X *** We're using X to mark off parts of the corpus that aren't syntactically analyzable. That is, extra tokens that clearly shouldn't be there and don't fit into the rest of the sentence. ----------------------------- Ann Bies bies@ldc.upenn.edu Linguistic Data Consortium October 4, 2004 -----------------------------