Penn Treebank II guidelines ("Bracketing Guidelines for Treebank II Style,
Penn Treebank Project," Ann Bies, Mark Ferguson, Karen Katz, and Robert
MacIntyre. University of Pennsylvania Computer and Information Science
Department Technical Report MS-CIS-95-06, LINC LAB 281) were followed as
closely as possible, with the following changes:

1. tokenization of hyphenated items ("New York-based" has been replaced
   by "New York - based" for example)

2. the addition of the node label NML for sub-NP nominal constituents
   (replacing NX and most NP-internal NAC)

3. the use of the X node primarily to mark stray translation errors and
   tokens that could be removed from the sentence (extra determiners, e.g.)

4. a more consistent use of PRN to mark parentheticals only without
   creating additional structure in the surrounding tree

Specific treebank guideline addenda are below (with thanks to Colin Warner
for collecting and writing up the changes).  Please see
/docs/pos-guidelines-addenda.txt for information on the tokenization and
tagging of hyphenated items.


*** NML ***

1. Nominal Subconstituents

NML is used to mark nominal subconstituents that do not follow our assumed
right-branching default structure:

(NP the (NML Hong Kong) economy)

(NP (NML high level) economic talks)

(NP the (ADJP (NML New York) based) company)


Premodifiers of nouns and adjectives (updates to sections 11.1.1 and
11.1.2) 

a. Default constituency within nominals

We assume a default right-branching structure under any NP and NML
node. Each daughter of the phrase (whether single token or itself
constituent node) is assumed to have scope over everything to its
right. This means that every daughter also forms a constituent with
everything to its right. 

This default structure does not apply to NP or NML nodes that have
coordinated elements or an apposition structure as daughters, although it
can be applied separately to each of those coordinated or appositive
elements. This assumption makes the annotation process for multi-token
nominals less complex and the resulting trees more legible, but still
allows us to readily derive constituent nodes not explicitly
represented. For example, in 

(NP primary liver cancer)

we assume that "liver cancer" is a constituent, and that "primary" has
scope over it. 

(NP a point mutation)

"a" has scope over the constituent "point mutation"

The ability to derive a constituent node for an NP minus its determiner may
be useful in aligning syntactic nodes with entities, which may not
consistently contain determiners. 

b. The NML node label: marking nominal modifiers

We use the NML node label to mark nominal subconstituents that do not
follow the default right-branching structure. Any two or more non-final
elements that form a constituent are bound together by NML.

(NP (NML Cytochrome P450) isoenzyme)
(NP selective (NML seratonin reuptake) inhibitor)
(NP (NML gel electrophoresis) analysis)
(NP (NML 5 nM) tetrachlorobiphenyl)
(NP (NML human liver tumor) analysis)

These nominal subconstituents can contain PPs:

(NP the
    (NML (NML guanine)
         (PP to (NP cytosine)))
    transformation)

(NP (NML (NML Secretary)
         (PP of (NP State))
    James Baker)

Note that NML replaces the use of NAC in nominal modifiers as outlined in
section 11.1.2.

NML can also mark multi-token nominal elements modifying an adjective:

(NP the (ADJP (NML New York) - based) company)
(NP (ADJP (NML Hoe 234) induced) relaxation)

c. Head derivation in NP and NML:

The head of an NP or NML is either the rightmost noun (NN or NNS), or is
contained within the rightmost NP or NML node. Recursive applications of
this rule can yield the head of phrases containing nested NP or NMl
nodes. These rules do not apply to coordinated or appositive structures,
which have multiple heads, although they can be applied to determine the
heads of the individual elements of those structures. 

This process of head derivation will not return a head for structures such
as (NP the rich) or (NP 12). We can assume either that the NP has a null
head, that it's headless, or that JJx or CD can serve as head when a
nominal head cannot be derived. 


2. Coordinated Premodifiers

Coordinated premodifiers form a constituent node, typically ADJP, UCP or
NML. Following standard policy for coordination, the individual
coordinated elements only receive syntactic nodes if one or more of them
is multi-token.

(NP this (NML (NML large scale) and (NML high level)) international
convention)
(NP (UCP JJdomestic and RBoverseas) markets)
(NP (UCP (ADJP scientific) and (ADJP technological) and (NML software
development)) companies)
(NP the (NML (NML Red Cross) and (NML Red Crescent)) movement)
(NP his (ADJP energetic and powerful) performance)


3. Coordinated heads with shared premodifiers.

In a coordinated NP, any modifiers that are left flat are assumed to be
shared across all the heads:

(NP China macroscopic ecnomic readjustment and control)
(NP (NP China's) international income and expenses)

Unshared modifiers in a coordinated structure must form a constituent node
with the head they are modifying, as in standard NP coordination:

(NP (NP material civilization) and (NP spiritual culture))

When unshared and shared modifiers are combined, the above structure is
preserved (using the NML node label) for the unshared components. That is
each of the coordinated elements is marked as a constituent and the
elements together form a constituent. Any unshared modifiers are left as
sisters to the coordinated structure:

(NP socialist (NML (NML material civilization) and (NML spiritual
culture)))
(NP (NML New Zealand) (NML (NML industries) and (NML business circles)))

This means that occasionally we have to show the scope of determiners:

(NP the (NML (NML international income) and (NML domestic expenses)))


*** PRN ***

Parentheticals are meant to be treated as though they are completely
separate from the main text. They do not have any syntactic relationship
to the rest of the sentence in which they fall. So they should have all
the internal structure they can, but should not contribute any extra
structure to the main S (ie, they shouldn't be adjoined). They should be
put at whichever level of the tree they most seem to modify.

For example:

(NP 2 cases (PRN ( (NP 67%) )))
(NP (NP 2) (PP of (NP 3 cases)) (PRN ( (NP 67%) )))

but NOT:
(NP (NP 2 cases) (PRN ( (NP 67%) ) ))


Note that if a PRN modifies a single-token, we do add extra structure to
the sentence so it can be attached at the right level. (But this isn't
really adding structure to the sentence, it's just a consequence of the
fact that we can't hang things off POS tags.)

(NP Kyoto and Tokyo)
versus
(NP (NP Kyoto (PRN ( (NP South Ward) )))
    and
   (NP Tokyo (PRN ( (NP North Ward) ))))

If there are multiple unrelated items inside a parenthetical, they do not
need to be bound together by FRAG. That is, a PRN can have multiple
daughters. If there are two items that have some relationship that is not
expressible with normal syntactic nodes, they should be bound together
with FRAG.

Bare VPs inside a PRN should not receive a null subject:
"(expressing X)"
(PRN ( (VP expressing (NP X)) ))


*** The use of X ***

We're using X to mark off parts of the corpus that aren't syntactically
analyzable. That is, extra tokens that clearly shouldn't be there and
don't fit into the rest of the sentence.


-----------------------------
Ann Bies 
bies@ldc.upenn.edu
Linguistic Data Consortium
October 4, 2004
-----------------------------