README - ARRAU 2 This is the second release of the ARRAU corpus of anaphoric information. ARRAU was created to provide support both for the development and testing of anaphora resolution systems and for empirical investigations of anaphora, and therefore includes, in addition to standard news text (all the WSJ articles included in the RST Discourse Treebank), text from very different genres: - task-oriented dialogue (the entire collection of transcripts from the TRAINS 91 and 93 corpora) - and spoken narratives (the Pear Stories collected by Wallace Chafe). ANNOTATION GUIDELINES These texts have been annotated according to the ARRAU guidelines, according to which all NPs are treated as markables, but their different semantic role is recognized by distinguishing between referring expressions (that update or refer to a discourse model), and non-referring ones (including expletives, predicative expressions, quantifiers, and coordination). Full NPs are marked: i.e., all modifiers are included. However, a MIN-ID is marked as in MUC (the head of the NP) for evaluations accepting partial credo, as in MUC. Both identity and associative anaphora to entities are annotated, as well as discourse deixis. Coders had the option to mark ambiguity in anaphoric relations. A variety of linguistic features were also annotated, including - morphosyntactic agreement - grammatical function - semantic type (PERSON / ANIMATE / CONCRETE / ACTION / TIME / OTHER ABSTRACT / etc) - genericity MARKUP LANGUAGE The annotation was carried out using the MMAX2 annotation tool freely available from Sourceforge at http://mmax2.sourceforge.net/ each sub-corpus directory contains at least one subdirectory: - MMAX, with the files in MMAX format (the RST subdirectory contains three subdirectories dev, test and train, each of which contains MMAX and RAW subdirectories; the VPC subdirectory contains a test and train subdirectory). The MMAX format is a form of token standoff representation of the annotations, whereby for each annotated document (say, Penn Treebank's file 0681) there are normally - a .mmax file (wsjarrau_0681.mmax in this case) providing information about which files contains the relevant annotation - a .header file (wsjarrau_0688.header) providing information about the original document and a history of the annotation in TEI header format; - a base file _words.xml in a subdirectory called Basedata (Basedata/wsjarrau_0681_words.xml) containing all the tokens of the original document - a number of 'level files', one for each level of annotation, in another subdirectory called markables. The anaphoric annotation itself is stored in a file at the phrase level: markables/wsjarrau_0681_phrase_level.xml the phrase level contains the annotation as specified in the annotation manuals, i.e., with discontinuous markables, ambiguity, etc. See below for the other levels. The files in MMAX format have been organized so that they can easily be visualized using the MMAX2 tool, or directly used as input / output for the BART toolkit: http://www.bart-anaphora.org to this end, all files have been automatically preprocessed to contain all the information required by BART, including POS information, chunking, parsing, etc, as well as a coref level containing a simplified representation of the anaphoric information without discontinuous markables and ambiguity. The complete list of level files for file 0681, for instance, includes: markables/wsjarrau_0681_chunk_level.xml markables/wsjarrau_0681_coref_level.xml markables/wsjarrau_0681_enamex_level.xml markables/wsjarrau_0681_markable_level.xml markables/wsjarrau_0681_morph_level.xml markables/wsjarrau_0681_parse_level.xml markables/wsjarrau_0681_phrase_level.xml markables/wsjarrau_0681_pos_level.xml markables/wsjarrau_0681_section_level.xml markables/wsjarrau_0681_sentence_level.xml markables/wsjarrau_0681_unit_level.xml In addition, for some corpora additional subdirectories are provided: - RAW, with the raw text - PARSED, with the full parse in Penn Treebank format. INVENTORY AND DESCRIPTIONS The contents of the directory are as follows. Doc/ --Documentation. This subdirectory contains the basic coding manual developed for the TRAINS data as well as addenda for GNOME and RST Discourse Treebank. Pear_Stories --The English Pear Stories In 1975, a six-minute film made at the University of California at Berkeley was shown by Wallace Chafe and his collaborators to speakers of a number of languages, who were then asked to tell what happened in it. The Pear Stories is the collection of the transcripts of the narratives told by these subjects; this collection was among the first data used to study topicality and reference, as discussed in Wallace Chafe (ed.), The Pear Stories: Cognitive, Cultural, and Linguistic Aspects of Narrative Production. Norwood, New Jersey: Ablex (1980). The English Pear Stories can now be downloaded from the site http://www.pearstories.org/ created by Mary S. Erbaugh, where the Pear Film can also be watched. These texts are included with by kind permission of Mary S. Erbaugh. RST_DTreeBank --The RST Discourse Treebank The RST Discourse Treebank is a subset (about 1/3) of the Penn Treebank whose discourse structure was annotated according to Rhetorical Structure Theory (RST) by Daniel Marcu and collaborators. This sub-corpus is the most substantial part of the ARRAU corpus. It is divided in three subsets: train, test, and dev. Test contains 38 files from Section 23 of the Penn Treebank (the RST Discourse Treebank contains 64 in total), for a total of 25244 tokens. Train contains 149 files from Sections 06, 11 and 13 of the Penn Treebank, for a total of 109686 tokens. Dev contains 18 files from Sections 00 and 02 of the Penn Treebank, for a total of 12863 tokens. Trains_91 --The TRAINS 91 Corpus The TRAINS project at the University of Rochester had the aim to develop a conversational agent able to support planning in the transportation domain. To support the development of the system a corpus was collected with humans playing both the role of the 'manager' and the role of 'system'. A first batch of data was collected in 1991, a second using more rigorous principles and naive users in 1993. Trains_91 is the entire corpus collected in 1991. Trains_93 --Texts from the TRAINS 93 Corpus This directory contains the annotation of all the files of the TRAINS 1993 corpus, available from LDC. VPC --The Vieira / Poesio Corpus This directory contains the files in the Vieira / Poesio corpus collected between 1995 and 1997 and used to develop and test the system described in Vieira and Poesio (CL 2000), included for those who wish to compare their system with Poesio and Vieira's. The corpus consists of 20 files from the Penn Treebank 2 for training and 15 for testing, a subset of the files in RST_DTreeBank. The files were originally annotated by Renata Vieira and four more annotators but only definite descriptions had been annotated. The files completely reannotated for this release using the same guidelines as the rest of the corpus. CONTENTS OF THE SUBDIRECTORIES The subdirectories for the sub-corpora all have the same structure. They all contain a directory called MMAX with the MMAX files. In the case of files for which the raw text is available, this is included in a directory called RAW. In the case of the Penn Treebank 2 files, the parsed files are also included in a PARSED directory. To summarize, each MMAX directory contains: - a number of .header and .mmax files, one for each annotated document; - Basedata and markables directories containing the base files and the annotations; - a single common_paths.xml file specifying where everything is, and a common directory with the MMAX2 scheme files. FUTURE WORK The third release, ARRAU2, expected by the end of 2014, will include a revised annotation for bridging references. ACKNOWLEDGMENTS ARRAU was created in part at the University of Essex between 2006 and 2007, under the direction of Ron Artstein and Massimo Poesio, and using funds from the ARRAU project. Many annotators contributed. A second batch of annotation was done at the Johns Hopkins ELERFED workshop in Summer 2007, by Ron, Janet Hitzeman, and Massimo. The final round of annotation was carried out at the University of Trento, using funds from the LiveMemories project, under the direction of Francesca Delogu, Kepa Rodriguez, Olga Uryupina, and Massimo. The RST Discourse Treebank was created under the direction of Daniel Marcu at USC/ ISI. It was annotated at the University of Essex and the University of Trento in collaboration of prof. Kibrik's group from the Academy of Sciences of Moscow. The VPC corpus was created under the direction of Renata Vieira and Massimo. Rodrigo Goulart, Mijail Kabadjov, and Janet Hitzeman contributed annotations.