=============================================== NXT SWITCHBOARD CORPUS PUBLIC RELEASE 2008 =============================================== This is the public release of the NXT-format version of the Switchboard Corpus. Before use, you MUST read 00LICENSE-SWBD.txt in this directory. This file contains important information for users of the Switchboard Corpus. It has some important notes concerning the annotations, describing unusual features and approximated information. ------------------------------------------------------ Getting started ------------------------------------------------------ If you have used NXT before, you can get stuck in with the metadata at xml/swbd-metadata.xml. You can find out about the corpus structure and how to use the query language by reading the website and README.SWBD-QUERIES.txt. ------------------------------------------------------ Documentation ------------------------------------------------------ Most of the documentation for the release is at http://groups.inf.ed.ac.uk/switchboard/. Also in this directory: README.SWBD-QUERIES.txt Corpus structure and queries. README.EXERCISES.txt Some rather old exercises to try on the data. README.TOOLS.txt A guide to GUIS for the data. switchboard-guis.sh/.bat Scripts to start the GUIs (read README.TOOLS.txt first). ------------------------------------------------------ Syntactic/Phonetic trees ------------------------------------------------------ Phonetic annotations were constructed from the msstate transcripts. They form a hierarchy phonwords--syllables--phonemes where phonwords in the phonetic tree are parallel to terminals in the syntax tree. The correspondence is represented by pointers from terminals to phonwords. This relationship can then be used in queries to "jump" from one tree to the other. (See example queries.) This mapping is necessary partly because of the differences in the transcriptions (the phonetic tree comes from msstate and the syntactic tree from Penn Treebank sources) and partly because the tokenisations of words and of phonetic words/units diverge in the case of contractions such as "don't". ------------------------------------------------------ Approximated turn timings ------------------------------------------------------ Start and end times were calculated for the turns by looking at the timings on the outermost of their underlying terminals. Where that timing information is not available on the outer terminals, we estimated it by reference to the closest available start and end times of other turns by the same agent. In this case we also added attributes for the "firstKnownStart" and "lastKnownEnd" times for the turn, which represent the outermost timing information which is available from the underlying terminals. Ther is also a new attribute "approx" in every turn, which takes the values "true" or "false" according to whether times were approximated. ------------------------------------------------------ Disfluencies ------------------------------------------------------ The disfluency annotation present in the NXT format corpus is derived from the Penn Treebank syntactic coding, not from the disfluencies that form part of the dialogue act coding in its original format, and were created using a pre-Treebank version of the corpus. Parts of that disfluency coding, such as the non-sentence elements, have no analogue in the NXT format at present. We have not considered what the relationship is between any Penn Treebank disfluencies (and therefore the NXT format) and the ones from the dialogue act coding. More information about the disfluency annotation that we didn't include can be found in "Dysfluency Annotation Stylebook for the Switchboard Corpus", Marie Meteer, revised by Ann Taylor, 1995. ftp://ftp.cis.upenn.edu/pub/treebank/swbd/doc/DFL-book.ps. In the NXT format, a disfluency contains a reparandum and a repair, and the terminals that make up these two elements are included directly as nite:children. Often multiple repairs are made in a disfluency, which results in a nested structure of disfluencies within disfluencies. This nesting only takes the form of containment - one disfluency will not partially overlap with another.