NXT Switchboard Annotations
|NXT Switchboard Annotations
|Sasha Calhoun, Jean Carletta, Daniel Jurafsky, Malvina Nissim, Mari Ostendorf, Annie Zaenen
|LDC Catalog No.:
|November 20, 2009
|natural language processing
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 (NFP, Non-Member)
LDC For-Profit Membership Agreement
|Subscription & Standard Members, and Non-Members
|Calhoun, Sasha, et al. NXT Switchboard Annotations LDC2009T26. Web Download. Philadelphia: Linguistic Data Consortium, 2009.
NXT Switchboard Annotations, brings together in NITE XML, a single XML format, the multiple layers of annotation performed on a transcript subset from Switchboard 1- Release 2, LDC97S62. NXT Switchboard Annotations was developed in a collaboration among researchers from Edinburgh University, Stanford University and the University of Washington.
The original Switchboard corpus is a collection of spontaneous telephone conversations between previously unacquainted speakers of American English on a variety of topics chosen from a pre-determined list. A subset of one million words from those conversations was annotated for syntactic structure and disfluencies as part of the Penn Treebank project. Phonetic transcripts were generated by the International Computer Science Institute, University of California Berkeley and later corrected by the Institute for Signal Information Processing, Mississippi State Univeristy. The Penn Treebank transcripts provided the basis for the NXT Switchboard corpus, and the noun phrases from that subset were annotated for animacy. The Treebank transcript was then aligned with the corresponding subset from the corrected Mississippi State (MS-State) transcript in order to provide word timing information. Focus/contrast and prosodic annotations, as well as phone/syllable alignment were next added to the annotations. The previous annotations of dialog acts and prosody were converted to NITE XML. Lastly, hand annotations for markables were added to provide information about their animacy and information structure, including coreferential links.
NXT is an open source toolkit that enables mutiple linguistic annotations to be assembled into a unified database. It uses a stand-off XML data format that consists of several XML files that point to each other. The NXT format provides a data model that describes how the various annotations for a corpus relate to one another. For that reason, it does not impose any particular linguistic theory or any particular markup structure. Instead, users define their annotations in a "metadata" file that expresses their contents and how they relate to each other in terms of the graph structure for the corpus annotations overall. The relationships that can be defined in the data model draw annotations together into a set of intersecting trees, but also allow arbitrary links between annotations over the top of this structure, giving a representation that is highly expressive, easier to process than arbitrary graphs and structured in a way that helps data users. NXT's other core component is a query language designed specifically for working with data conforming to this data model. Together, the data model and query language allow annotations to be treated as one coherent set containing both structural and timing information.
The data in NXT Switchboard Annotations was converted from the Penn Treebank bracketed format in which the Switchboard corpus was originally distributed using an XML-based tool for syntactic query that comes with a ready-made Switchboard converter. Conversion was performed using a set of XSL stylesheets to extract each of the multiple XML files associated with one dialogue. The data was divided into separate XML files representing the orthographic transcription, syntax, turn structure, disfluencies and movement, or the relationship between traces and their sources. Transcription consists of a flat list of terminals: words, punctuation, traces, and so on. Syntax starts with a flat list of parses and works down through nonterminals, grounding in terminals (which are in the transcription file, but are referenced by pointers that indicate they are to be treated as if they were part of the tree itself). Turn structure is simply a flat list of turns that themselves contain parses as children, again via pointers into the syntax file. Yet another file couples reparanda and repairs into disfluencies by pointing to the appropriate nonterminals using named roles. A movement file similarly links sources with their target traces. While this representation may seem awkward, it has advantages over the original arrangement. First, it places the information in a single tree structure, with co-indexing for the crossing links that are sometimes required for disfluency and movement. Secondly, it facilitates querying the crossing structures, since they are treated on a par with other structures within the data. Although this ease is not particularly important for the initial, syntactic data, it is crucial for a correct understanding of discourse phenomena such as coreference. Third, separating the tags into their various types makes it easier to add data using external processes (part-of-speech taggers, named entity recognizers, and the like). Fourth, different people can change different data files at the same time without conflict, as long as neither edit the files they point to and both are able to lock complete paths of files pointing to the data they are revising. Last, a data set can be loaded in whole or in part, speeding up some processing. The NITE XML Toolkit itself treats the data seamlessly no matter whether it is in one file or many.
This corpus is made available to LDC not-for-profit members and all nonmembers under the Creative Commons Attribution-Noncommercial Share Alike 3.0 license. NXT Switchboard Annotations is available to LDC's for-profit members under the terms of their For-Profit Membership Agreements.