File: README.2nd ----------------- Marking Conventions for the BLLIP 1987-89 WSJ Corpus Release 1 All processing was done by machine and consists of (a) basic parsing, (b) grammatical/function-tag assignment, (c) full noun-phrase coreference identification, (d) pronoun reference identification, and (e) empty node insertion. When we talk about "parsing" we mean processes a, b, and e. When we talk about coreference we mean processes c and d. Papers describing all programs except for (e) can be found in the papers subdirectory. The citation for the pronoun-coreference paper is ``A statistical approach to anaphora resolution'' Niyu Ge, John Hale, and Eugene Charniak, Proceedings of the Sixth Workshop on Very Large Corpora (1998) The papers on parsing and function tagging will appear (or by the time you read this, have appeared) in Proceedings of the North American Chapter of the ACL, 2000. For recent information and papers consult http://www.cog.brown.edu/Research/nlp/ and http://www.cs.brown.edu/people/ec/ The files processed are the text files for the Wall Street Journal from 1987 to 1989 provided in the ACL/DCI corpus. However the files 109 to 145 for 1988 are not included. These files consist primarily of repeated material found in other files. As currently configured there is one directory for each year. In the ACL/DCI release each of these directories has some number of text files. In our version each of these corresponds to a subdirectory under the year. Within each of these subdirectories the original text has been divided up into many files that generally should correspond to individual news articles. This is done by looking for delimiters. Only material delimited by was processed. In some cases the only material between delimiters was figures (i.e., there was no text enclosed by delimiters). This resulted in empty files, which have been removed. All parsing is done using the Penn Treebank conventions with the following exceptions: (a) Certain auxiliary verbs (e.g., "have", "been" etc.) are deterministically labeled AUX or AUXG (e.g., "having"). (b) Root nodes are given the new non-terminal label S1 (as opposed to the empty string in the treebank). (c) Number attached to non-terminals indicating coreference are preceded by "#" (as opposed to "-" in the treebank). (d) We have added two new grammatical function tags, PLE and DEI. See below. To save on parsing time, any sentence of length greater than 70 words and punctuation was ignored. We expect future "releases" to increase the maximum length. In about one news story in a thousand there was some parser error, so the story got cut short. When we get more expert at this we expect this error rate to go down. Only complete noun-phrases, pronouns, and traces can be marked coreferent. This is typically done by adding a "#" followed by a number (unique within the article) to all occurrences that are deemed to be coreferent. (An exception here is traces, see below). In most cases the # is added to the NP non-terminal label, e.g., (NP-SUBJ#4 John Blaire & Co). There are two exceptions. For possessive pronouns the number is attached to the PRP$ non-terminal, because the NP above the pronoun typically does not denote the same entity as the pronoun. For trace elements the number is attached to the trace marker preceeded by a "-" in keeping with treebank format (e.g., *T*-5). We also mark two forms of non-coreferential pronouns: DEI (deictic), and PLE (pleonastic). The pronouns that are marked with coreference indicators are I, you, he, she, it, we, they, and all variants (objective, possessive, and reflexive forms). Other words that are somtimes considered pronouns, such as "this" "that" etc. are not marked. Treebank marks some constituents as NPs that in most other theories would be marked as NBAR or some such. E.g., the NP "the dog" in (NP (NP the dog) (PP on the swing)) We do not mark such NPs with coreference indicators. Thus only the NP that dominates "the dog on the swing" could receive a coreference indicator. When Treebank conventions for trace elements conflict with the above, we have typically followed the treebank conventions. In particular this means WHNP's WHADVP's etc. are typically given numeric indicators, not the NPs they "modify". E.g., (NP (NP the dog) (SBAR (WHNP#1 who) (S (NP (-NONE- *-1)) (VP is on the swing)))) Also in cases where several traces all corefer to the same entity Treebank gives them different numeric indicators, letting the coreference chain indicate that they are the same. We do likewise.