File: README.2nd
-----------------
Marking Conventions for the BLLIP 1987-89 WSJ Corpus Release 1
All processing was done by machine and consists of (a) basic parsing,
(b) grammatical/function-tag assignment, (c) full noun-phrase coreference
identification, (d) pronoun reference identification, and (e) empty
node insertion. When we talk about "parsing" we mean processes a, b,
and e. When we talk about coreference we mean processes c and d.
Papers describing all programs except for (e) can be found
in the papers subdirectory. The citation for the pronoun-coreference
paper is ``A statistical approach to anaphora resolution''
Niyu Ge, John Hale, and Eugene Charniak,
Proceedings of the Sixth Workshop on Very Large Corpora
(1998)
The papers on parsing and function tagging will appear (or by the time
you read this, have appeared) in Proceedings of the North American
Chapter of the ACL, 2000. For recent information and papers consult
http://www.cog.brown.edu/Research/nlp/ and
http://www.cs.brown.edu/people/ec/
The files processed are the text files for the Wall Street Journal from
1987 to 1989 provided in the ACL/DCI corpus. However the files 109 to 145
for 1988 are not included. These files consist primarily of repeated
material found in other files.
As currently configured there is one directory for each year. In the
ACL/DCI release each of these directories has some number of text
files. In our version each of these corresponds to a subdirectory
under the year. Within each of these subdirectories the original text
has been divided up into many files that generally should correspond
to individual news articles. This is done by looking for
delimiters.
Only material delimited by was processed. In some cases the
only material between delimiters was figures (i.e., there
was no text enclosed by delimiters). This resulted in empty files,
which have been removed.
All parsing is done using the Penn Treebank conventions with the
following exceptions:
(a) Certain auxiliary verbs (e.g., "have", "been" etc.) are
deterministically labeled AUX or AUXG (e.g., "having").
(b) Root nodes are given the new non-terminal label S1 (as
opposed to the empty string in the treebank).
(c) Number attached to non-terminals indicating coreference
are preceded by "#" (as opposed to "-" in the treebank).
(d) We have added two new grammatical function tags, PLE
and DEI. See below.
To save on parsing time, any sentence of length greater than 70
words and punctuation was ignored. We expect future "releases" to
increase the maximum length.
In about one news story in a thousand there was some parser error, so
the story got cut short. When we get more expert at this we expect
this error rate to go down.
Only complete noun-phrases, pronouns, and traces can be marked
coreferent. This is typically done by adding a "#" followed by a
number (unique within the article) to all occurrences that are deemed
to be coreferent. (An exception here is traces, see below). In most
cases the # is added to the NP non-terminal label, e.g., (NP-SUBJ#4
John Blaire & Co). There are two exceptions. For possessive pronouns
the number is attached to the PRP$ non-terminal, because the NP above
the pronoun typically does not denote the same entity as the pronoun.
For trace elements the number is attached to the trace marker
preceeded by a "-" in keeping with treebank format (e.g., *T*-5). We
also mark two forms of non-coreferential pronouns: DEI (deictic), and
PLE (pleonastic).
The pronouns that are marked with coreference indicators are I, you, he,
she, it, we, they, and all variants (objective, possessive, and
reflexive forms). Other words that are somtimes considered
pronouns, such as "this" "that" etc. are not marked.
Treebank marks some constituents as NPs that in most other theories
would be marked as NBAR or some such. E.g., the NP "the dog" in
(NP (NP the dog)
(PP on the swing))
We do not mark such NPs with coreference indicators. Thus only
the NP that dominates "the dog on the swing" could receive a coreference
indicator.
When Treebank conventions for trace elements conflict with the above,
we have typically followed the treebank conventions. In particular this
means WHNP's WHADVP's etc. are typically given numeric indicators, not
the NPs they "modify". E.g.,
(NP (NP the dog)
(SBAR (WHNP#1 who)
(S (NP (-NONE- *-1)) (VP is on the swing))))
Also in cases where several traces all corefer to the same entity
Treebank gives them different numeric indicators, letting
the coreference chain indicate that they are the same. We do
likewise.