====================================================== SWITCHBOARD QUERIES ====================================================== This file provides an introduction to writing queries for the Switchboard corpus using NXT. Contents are: -- Some Recommendations for Trying Queries -- Introduction to Queries -- Further Comments on Usage -- The Corpus Structure -- Advanced Examples -- Some simple queries -- Phonetic Structure (MS-State) -- Word -> Phonword Mapping -- Dialogue - Speaker - Topic -- Markables -- Disfluency Structure -- Use of SearchandFilter For information about tools in which to try the queries, see README.TOOLS.txt. ====================================================== Introduction to Queries ====================================================== Queries consist of a number of variable bindings followed by a colon and then the match conditions. A variable binding has a variable (with a dollar sign) and an (optional) data type. The data model here is of data elements with a type and a set of string or numeric attributes that can have children (in this corpus, always ordered) and/or pointers with named roles to other elements (always unordered). Think of this as a tree structure for the basic syntax with some arbitrary graph structure over the top for things that don't fit into the tree. (NITE allows for a multi-rooted tree where each node can have more than one parent of different types, unordered with respect to each other, but only one ordered set of children --- but we don't make use of this facility at the moment.) ($m markable):($m@animacy == "human-group") This means "Markables where the animacy code is human-group". @ means "attribute". ($m markable):($m@animacy ~ /.*human.*/) This means "Markables where the animacy code has human in it somewhere." The dot (.) means any character, and the star (*) means zero or more times. This is what's called a "regular expression". ($m markable):($m@animacy ~ /human.*/) This means "Markables where the animacy code is human followed by any number of other characters". Note that this doesn't pick up what you coded as org-human. That's because the code has to *start* with human. ($m markable):($m@animacy ~ /human/) This doesn't pick up any markables, because all of the codes are human followed by something or preceded by something, and in this query language, regular expressions specify complete matches. ($m markable):($m@animacy == "human") && ($m@status=="old") && means "and"; similarly, || for "or". ($n nt)($w word):($n ^ $w) && ($w@pos=="VBZ") && ($n@cat="NP") ^ means "is an ancestor of". In this corpus, an nt is a (nonterminal) syntactic constituent. So this finds pairs of nts and words where the word is in the nt, the nt is a noun phrase, and the word has part of speech VBZ. Verify this by seeing that each result has bindings for two things, one of which is an nt and the other of which is a word. Note that the same word can show up in two returns, if it is in two NPs (one embedded in the other). ====================================================== Further Comments on Usage ====================================================== Note that ($n nt)($w word): simply gives all the nt/word pairs --- a vast number. A common mistake in queries is to forget the relational conditions (in this case, the one with ^). Also, perhaps counter-intuitively, an element is ^ itself. For this reason, the idiom ($a)($b): ($a ^ $b) && ($a != $b) is common in queries where the variables are bound to the same data type. By the way, ($a): matches every element in the corpus regardless of data type, and it is possible to match on a type disjunction, i.e., ($n nt | word): will match on all nts and all words. ($m markable)($n nt):($m@animacy == "human") && ($m >"at" $n) && ($n@subcat="SBJ") In the corpus, most markables point at nts (nonterminals); this query finds such markables with human animacy in subject position. Pointers always have some role name given by whoever designed the corpus; in this case, it is "at". Other markables use "terminalat" roles to point at words. You can find either by using a disjunctive type and not specifying the role: ($m markable)($n nt|word):($m>$n) ($n nt)($w1 word)($w2 word): ($n ^ $w1) && ($n ^ $w2) && ($w1 != $w2) && ($w1@pos = "DT") && ($w2@pos="NN") nts that contain both a DT and an NN. Of course, this can match on different (embedded) nts for the same DT/NN pair. ($n nt)($w1 word)($w2 word): ($n ^ $w1) && ($n ^ $w2) && ($w1 != $w2) && ($w1@pos = "DT") && ($w2@pos="NN") && ($w1 <> $w2) The same, but the DT has to be before the NN. ($n nt)(exists $w1 word)(exists $w2 word): ($n ^ $w1) && ($n ^ $w2) && ($w1 != $w2) && ($w1@pos = "DT") && ($w2@pos="NN") && ($w1 <> $w2) For when you get tired of seeing the words in the match list. Exists does the same match but doesn't return the variable in the result set. ($m1 markable)($m2 markable)(exists $l link): ($l >"antecedent" $m1) && ($l >"anaphor" $m2) Pairs of markables in the same coreferential link. ($m1 markable)($m2 markable)(exists $l link): ($l >"antecedent" $m1) && ($l >"anaphor" $m2) && ($m2@animacy != $m2@animacy) Same, but where the two markables don't have the same animacy code. There are two points to this query: (1) you don't have to specify a textual string in the inequality condition as long as you can get one from somewhere, and (2) one might wish to consider queries where one expects no matches because they can diagnose problems with the annotation (in this case, of course, more match conditions are needed). ($w1 word)($w2 word):($w1 <> $w2) Pairs of words where the first precedes the second. Note that this says nothing about being in the same sentence; that would be ($w1 word)($w2 word)($n nt):($w1 <> $w2) && ($n@cat=="S")&& ($n^$w1)&&($n^$w2) ($w1 word)($n nt)(exists $w2 word):($n@cat=="S")&& ($n^$w1)&& (($n^$w2) ->($w1 <> $w2)) All words excluding last words of sentences. ($w1 word)($n nt)(forall $w2 word):($n@cat=="S")&& ($n^$w1)&& ((($w1 !=$w2) && ($n^$w2)) ->($w1 <> $w2)) Only the first words of sentences. Note the inequality condition; forall really means for *all*. ($w1 word)($w2 word)(forall $w3 word): ($w1@pos = "DT") && ($w2@pos="NN") && ($w1 <> $w2) && ((($w1 != $w3) && ($w2 != $w3)&& ($w1 <> $w3)) -> ($w2 <> $w3)) DTs and NNs adjacent to each other with the DT first (i.e., forall other words, if they're after the DT they're also after the NN). If this is too slow on your machine, try ($t1 turn)($t2 turn)(forall $t3 turn): ($t1 <> $t2) && ((($t1 != $t3) && ($t2 != $t3)&& ($t1 <> $t3)) -> ($t2 <> $t3)) for adjacent turns by the same speaker (which is faster because there are fewer of them). The query means "pairs of turns where there isn't another turn structurally between them" - because the turn files are organized one per speaker with no explicit turn order between speakers, the only way to get "adjacent" turns for different speakers is to express conditions based on the timings. Note that on forall queries, the interim reports about numbers of matches found are a bit wonky! They are multiples of the real number of matches by the number of bindings tested for the forall variable. ($w word): ($w@pos="PRP$")::($m markable):($m >"terminalat" $w) A complex query; the first query (before the ::) matches, and then any the results are passed to the second query, which can bind new variables as well as referring to the old ones. The return list is hierarchically structure; for each match n-tuple to the first query, one gets a list of match n-tuples to the second. Beware: if there are no matches to the second query for some match to the first query, then that match to the first query is removed from the result list. (This makes sense in database terms but some people find this strongly counter-intuitive.) ($n nt)($w word): ($n ^ $w) && (ID($w) == "s1_1") Every data element has a unique id, which can be used in queries in this way. There's no reason you would want to do this except when you can't figure out why a query is going wrong and want to quickly find out whether a specific example is on the return list. ($w word): (TEXT($w) == "the") This is how to query the orthography. Posix regular expressions work here, too. ($w word): (TEXT($w) ~/the.*/) keeping in mind that the regexp much match the entire string. ====================================================== The Corpus Structure ====================================================== You can't write queries without understanding the structure of the corpus. First, we gloss the most important relationships for an easy start, but the only way to get at everything is to read the metadata file (and perhaps look at some of the data for reassurance), so we also explain how to do that. The corpus uses parenthood for the following relationships: turn ^ parse ^ nt ^ (word | sil | trace | punc) with any number of levels of nt, usually starting at the top with at nt with cat S. It uses pointers for the following relationships: markable >"at" nt markable >"terminalat" word (the word cases are just possessive pronouns. It is intended that every markable will have at "at" role, or a "terminalat" role, but not both.) movement >"source" nt movement >"target" trace OR movement >"target-syn" nt link >"antecedent" markable link >"anaphor" markable It uses the following attributes: nt has cat (S, NP, VP, SBARQ, ...) subcat (SBJ, ...) word has pos (VBZ, ...) This list is *not* complete. We think that everything on the original Switchboard data has been preserved in some way. To find out exactly what the corpus structure is, open swbd/swbd-metadata.xml. (Many web browsers will make a display for XML files that is easier to read than in, say, emacs, so do try that first.) If you see Then foo is a valid data type. The definition of FOO runs until you see , but is shorthand for If you see then the containing code has that attribute, so you can say ($n FOO):($n@BAR). Enumerated attributes must choose a value from the given list; otherwise they can be free-value strings or numbers. The layer structure defines the permissible relations among data types (or codes). Each layer can be uniquely identified by name and defines a set of data types that are interchangeable in the structure because they can occur in the same positions. A structural layer can point to another layer, which means that elements with data types in the former layer have children with data types drawn from the data types in the latter layer. If it points recursively, then there are any number of layers of the former type ending in one of the latter type (this is handy, say, for syntax). If you see then the containing code can have pointers with role BAZ where the element pointed to has a data type drawn from the layer BAM. BAR can be an integer (pointer points to exactly that many elements), * (points to zero or more; i.e. Kleene *) and + (points to one or more). But I'm not sure how well the implementation enforces these number definitions. The design is meant to restrict pointers to featural layers, but the implementation is actually more flexible, with pointers allowed anywhere. Layers are themselves separated into codings. This isn't very important for this corpus (the codings are what allows for multi-rooted trees) but it does tell you what file to look in for the elements of a particular type; each coding is stored in a different XML file. Where files need to refer to each other, they use stand-off annotation. For instance, means the nt dominates/has as children two elements, the ones in the file sw2065.terminals.xml with the ids s1_1 and s1_2. Domination can also be represented in a single file by containment: means foo dominates baz. means *all* elements between s1_1 and s1_5 in the named file regardless of type or id and is only defined if s1_1 and s1_5 are sisters under the same element. The file syntax for pointers is very similar; e.g. The metadata contains everything one needs to know about corpus structure, but some people find it easier to look at sample data itself. ====================================================== Advanced Examples ====================================================== --------------------------------------------------------- Some fairly simple queries --------------------------------------------------------- Find all non-boundary words inside turns: ($w word)($t turn): $t^$w && START($t)END($w) Find each turn whose start is later than its end (should have no results!): ($t turn): START($t)>END($t) All words that are both children of an NP and old ($np nt) ($m markable) ($w word): ($np@cat=="NP") && ($m@status=="old") && ($np ^ $w) && ($m >"at" $np) Find all phonwords containing an "ax" phoneme: ($pw phonword)(exists $ph ph): TEXT($ph)=="ax" && $pw^$ph Find all phonwords containing an "ax" phoneme in an unstressed syllable: ($pw phonword)(exists $s syllable)(exists $ph ph): TEXT($ph)=="ax" && $pw^$s && $s@stress=="n" && $s^$ph --------------------------------------------------------- Word -> Phonword Mapping --------------------------------------------------------- Being sourced from two versions of the transcript, the NXT Switchboard corpus has to provide a link between the two and the different levels of annotation they comprise. Linking the Penn and MS-State annotation structures is an NXT pointer, which points from the word terminals to a parallel annotation called 'phonword'. Users need to describe this when constructing queries involving both transcripts. For instance, if you wanted to extract all verbs and their primary stresses, the query would look like this: ($w word)($pw phonword)($s syllable): ($w@pos ~ /V.*/) && ($s@stress=="p") && ($w > $pw) && ($pw ^ $s) That is, you need a separate variable for the word in each of the two transcripts ($w and $pw), as well as representing the relationship between them ($w > $pw). Users then need to be careful to query the attributes and relationships they are interested in in relation to the right sort of word, e.g. here the part-of-speech information (pos) is an attribute of word elements, while syllable elements are situated underneath phonword elements. --------------------------------------------------------- Dialogue - Speaker - Topic --------------------------------------------------------- A dialogue has pointers to a topic, and two speakers (with pointer roles 'topic', 'A' and 'B'). For example to search for the topics of dialogues where Northern dialect is used: ($d dialogue)(exists $s speaker): $d>$s && $s@dialect="NORTHERN" :: ($t topic): $d>$t The knowledge of which speaker is which within a dialogue is inherent in the names of the pointers ('A' or 'B'); it cannot be stored in the speaker elements as speakers appear in multiple dialogues. Therefore it is necessary to distinguish the speakers within the query if you wish to use this information: ($d dialogue)($sa speaker)($sb speaker): $d>"A"$sa && $d>"B"$sb Find dialogues with same-sex speakers: ($d dialogue)($sa speaker)($sb speaker): $d >"A" $sa && $d >"B" $sb && $sa@sex == $sb@sex --------------------------------------------------------- Markables --------------------------------------------------------- In the NXT data, "markables" were added automatically to form the basis of the information structure and animacy codings. The researchers involved used a predecessor of the current Index command line utility, running it using the following two queries, in order: /* FIRST QUERY */ ($n nt)(forall $up nt): (($n@cat == 'NP') or ($n@cat == 'WHNP')) /* match only NP or WHNP's */ and (not /* without the banned subcats */ (($n@subcat ~ /.*ADV.*/) or ($n@subcat ~ /.*LOC.*/) or ($n@subcat ~ /.*TMP.*/) or ($n@subcat ~ /.*DIR.*/) or ($n@subcat ~ /.*UNF.*/))) /* (including UNF, or unfinished, as banned) */ and /* where in addition forall nts */ ((($n != $up) and($up ^ $n)) -> /* if it truly dominates the match then */ ((not ($up@cat == 'EDITED')) and /* it isn't EDITED (part of a disfluency) and */ (not (($up@cat == 'ADVP') and /* it isn't ADVP with a banned subcat */ (($up@subcat ~ /.*LOC.*/) or ($up@subcat ~ /.*DIR.*/) or /* (note the ADVP can be UNF, we don't mind) */ ($up@subcat ~ /.*TMP.*/)))) and (not /* and it isn't a PP with subcat TMP */ (($up@cat == 'NP') and ($up@subcat ~ /.*TMP.*/))) and (not /* and it isn't a PP with subcat TMP */ (($up@cat == 'PP') and ($up@subcat ~ /.*TMP.*/))))) The first query means "NPs and WHNPs that aren't adverbials, locatives, directives, or unfinished, and where there aren't any dominating nts marked as EDITED (that is, disfluent) or as locative or directive adverbials." /* SECOND QUERY */ ($w word)(exists $n nt)(exists $m markable)(forall $up nt): ($w@pos = 'PRP$') and ($n ^ $w) and ($m >'at' $n) and (($up != $n) -> (not (($n ^ $up) and ($up ^ $w)))) The second query means "Possessive pronouns where the first nt you get to by climbing up counts as a markable. " The "active" codes similarly were added automatically, based on the syntactic coding. --------------------------------------------------------- Disfluency Structure --------------------------------------------------------- When querying the disfluency format, users should be aware that disfluencies can be nested. Thus to retrieve all the terminals within a disfluency it is necessary to look for all terminals which a disfluency dominates, not just those it directly dominates. For example, the query: (exists $d disfluency)($w word): $d^$w will find all word terminals which are part of a disfluency. The same thing without the existential qualifier: ($d disfluency)($w word): $d^$w will find all word - disfluency pairs where the former is contained in the latter. Note that this will likely produce duplicate words, due to the nesting of disfluencies. Find repairs containing the word "but": (exists $d disfluency)(exists $w word)($rep repair): $w@orth="but" && $d^$rep && $rep^$w --------------------------------------------------------- Use of SearchandFilter --------------------------------------------------------- The NXT tool SearchAndFilter allows you to query the data, and then output attributes of the variables queried in tab delimited format: java SearchAndFilter -corpus swbd-metadata.xml -observation sw2145 -query '($w word) ($pw phonword) ($a accent): ($w@pos ~ /V*/) && ($a@type=="nuclear") && ($w > $pw) && ($a > $pw)' -filter '$w@orth' '$w@nite:start' '$w@nite:end' '$a@nite:start'