======================================================
		SWITCHBOARD QUERIES
======================================================

This file provides an introduction to writing queries for the
Switchboard corpus using NXT. Contents are:

-- Some Recommendations for Trying Queries
-- Introduction to Queries
-- Further Comments on Usage
-- The Corpus Structure
-- Advanced Examples
    -- Some simple queries
    -- Phonetic Structure (MS-State)
    -- Word -> Phonword Mapping
    -- Dialogue - Speaker - Topic
    -- Markables
    -- Disfluency Structure
    -- Use of SearchandFilter

For information about tools in which to try the queries, see
README.TOOLS.txt.

======================================================
	Introduction to Queries
======================================================

Queries consist of a number of variable bindings followed by
a colon and then the match conditions.  A variable binding has
a variable (with a dollar sign) and an (optional) data type.
The data model here is of data elements with a type and a set
of string or numeric attributes that can have children (in this
corpus, always ordered) and/or pointers with named roles to other 
elements (always unordered).  Think of this as a tree structure
for the basic syntax with some arbitrary graph structure over
the top for things that don't fit into the tree.  (NITE allows
for a multi-rooted tree where each node can have more than one
parent of different types, unordered with respect to each other, 
but only one ordered set of children --- but we don't make use
of this facility at the moment.)

   ($m markable):($m@animacy == "human-group")

This means "Markables where the animacy code is human-group".
@ means "attribute".  

   ($m markable):($m@animacy ~ /.*human.*/)

This means "Markables where the animacy code has human in it
somewhere."  The dot (.) means any character, and the star (*) means
zero or more times.  This is what's called a "regular expression".

   ($m markable):($m@animacy ~ /human.*/)

This means "Markables where the animacy code is human followed by any
number of other characters".  Note that this doesn't pick up what you
coded as org-human.  That's because the code has to *start* with human.

   ($m markable):($m@animacy ~ /human/)

This doesn't pick up any markables, because all of the codes are human
followed by something or preceded by something, and in this query
language, regular expressions specify complete matches.

   ($m markable):($m@animacy == "human") && ($m@status=="old")

&& means "and"; similarly, || for "or".  

   ($n nt)($w word):($n ^ $w) && ($w@pos=="VBZ") && ($n@cat="NP")

^ means "is an ancestor of".  In this corpus, an nt is a (nonterminal)
syntactic constituent.  So this finds pairs of nts and words where the
word is in the nt, the nt is a noun phrase, and the word has part of speech
VBZ.  Verify this by seeing that each result has bindings for two things,
one of which is an nt and the other of which is a word.  Note that the
same word can show up in two returns, if it is in two NPs (one embedded
in the other).


======================================================
	Further Comments on Usage
======================================================

Note that

   ($n nt)($w word):

simply gives all the nt/word pairs --- a vast number.  A common mistake
in queries is to forget the relational conditions (in this case, the
one with ^).  Also, perhaps counter-intuitively, an element is ^ itself.
For this reason, the idiom

   ($a)($b): ($a ^ $b) && ($a != $b)

is common in queries where the variables are bound to the same data type.
By the way,

   ($a):

matches every element in the corpus regardless of data type, and it is
possible to match on a type disjunction, i.e.,

   ($n nt | word):

will match on all nts and all words.

   ($m markable)($n nt):($m@animacy == "human") && ($m >"at" $n)  && ($n@subcat="SBJ")

In the corpus, most markables point at nts (nonterminals); this query finds
such markables with human animacy in subject position.  Pointers always have
some role name given by whoever designed the corpus; in this case, it is
"at".  Other markables use "terminalat" roles to point at words.  You can find
either by using a disjunctive type and not specifying the role:

  ($m markable)($n nt|word):($m>$n)

   ($n nt)($w1 word)($w2 word): ($n ^ $w1) && ($n ^ $w2) && ($w1 != $w2) &&
      ($w1@pos = "DT") && ($w2@pos="NN")

nts that contain both a DT and an NN.  Of course, this can match on
different (embedded) nts for the same DT/NN pair.

   ($n nt)($w1 word)($w2 word): ($n ^ $w1) && ($n ^ $w2) && ($w1 != $w2) &&
      ($w1@pos = "DT") && ($w2@pos="NN") && ($w1 <> $w2)

The same, but the DT has to be before the NN.

   ($n nt)(exists $w1 word)(exists $w2 word): 
      ($n ^ $w1) && ($n ^ $w2) && ($w1 != $w2) &&
      ($w1@pos = "DT") && ($w2@pos="NN") && ($w1 <> $w2)

For when you get tired of seeing the words in the match list.  Exists
does the same match but doesn't return the variable in the result
set.

   ($m1 markable)($m2 markable)(exists $l link):
      ($l >"antecedent" $m1) && ($l >"anaphor" $m2)

Pairs of markables in the same coreferential link.

   ($m1 markable)($m2 markable)(exists $l link):
      ($l >"antecedent" $m1) && ($l >"anaphor" $m2) && 
	  ($m2@animacy != $m2@animacy) 

Same, but where the two markables don't have the same animacy code. 
There are two points to this query:  (1) you don't have to specify
a textual string in the inequality condition as long as you can get
one from somewhere, and (2) one might wish to consider queries where
one expects no matches because they can diagnose problems with the 
annotation (in this case, of course, more match conditions are needed).

   ($w1 word)($w2 word):($w1 <> $w2)

Pairs of words where the first precedes the second.  Note that
this says nothing about being in the same sentence; that would be

   ($w1 word)($w2 word)($n nt):($w1 <> $w2) && ($n@cat=="S")&&
       ($n^$w1)&&($n^$w2)

   ($w1 word)($n nt)(exists $w2 word):($n@cat=="S")&& ($n^$w1)&&
       (($n^$w2) ->($w1 <> $w2))

All words excluding last words of sentences.

   ($w1 word)($n nt)(forall $w2 word):($n@cat=="S")&& ($n^$w1)&&
       ((($w1 !=$w2) && ($n^$w2)) ->($w1 <> $w2))

Only the first words of sentences.  Note the inequality condition;
forall really means for *all*.

   ($w1 word)($w2 word)(forall $w3 word): 
      ($w1@pos = "DT") && ($w2@pos="NN") && ($w1 <> $w2) && 
      ((($w1 != $w3) && ($w2 != $w3)&& ($w1 <> $w3)) -> ($w2 <> $w3))

DTs and NNs adjacent to each other with the DT first (i.e., forall
other words, if they're after the DT they're also after the NN).
If this is too slow on your machine, try

   ($t1 turn)($t2 turn)(forall $t3 turn): 
      ($t1 <> $t2) && 
      ((($t1 != $t3) && ($t2 != $t3)&& ($t1 <> $t3)) -> ($t2 <> $t3))

for adjacent turns by the same speaker 
(which is faster because there are fewer of them).  The query means
"pairs of turns where there isn't another turn structurally between
them" - because the turn files are organized one per speaker with no
explicit turn order between speakers, the only way to get "adjacent"
turns for different speakers is to express conditions based on the timings.

Note that on forall queries, the interim reports about numbers of
matches found are a bit wonky!  They are multiples of the real
number of matches by the number of bindings tested for the forall
variable.

   ($w word): ($w@pos="PRP$")::($m markable):($m >"terminalat" $w)

A complex query; the first query (before the ::) matches, and
then any the results are passed to the second query, which can
bind new variables as well as referring to the old ones.  The
return list is hierarchically structure; for each match n-tuple
to the first query, one gets a list of match n-tuples to the
second.  Beware:  if there are no matches to the second query
for some match to the first query, then that match to the first
query is removed from the result list.  (This makes sense in
database terms but some people find this strongly counter-intuitive.)

   ($n nt)($w word): ($n ^ $w) && (ID($w) == "s1_1")

Every data element has a unique id, which can be used in queries
in this way.  There's no reason you would want to do this except
when you can't figure out why a query is going wrong and want to
quickly find out whether a specific example is on the return list.

   ($w word): (TEXT($w) == "the")

This is how to query the orthography.  Posix regular expressions work
here, too.  

   ($w word): (TEXT($w) ~/the.*/)

keeping in mind that the regexp much match the entire string.

======================================================
	The Corpus Structure
======================================================

You can't write queries without understanding the structure of
the corpus.  First, we gloss the most important relationships
for an easy start, but the only way to get at everything is to
read the metadata file (and perhaps look at some of the data for
reassurance), so we also explain how to do that.

The corpus uses parenthood for the following relationships:
   
turn ^ parse ^ nt ^ (word | sil | trace | punc)
     with any number of levels of nt, usually starting at the top 
     with at nt with cat S.

It uses pointers for the following relationships:

markable >"at" nt
markable >"terminalat" word
   (the word cases are just possessive pronouns.  It is intended that every markable will have at "at" role, or a "terminalat" role, but not both.)

movement >"source" nt
movement >"target" trace
OR
movement >"target-syn" nt

link >"antecedent" markable
link >"anaphor" markable

It uses the following attributes:

nt has cat (S, NP, VP, SBARQ, ...)
       subcat (SBJ, ...)

word has pos (VBZ, ...)

This list is *not* complete.  We think that everything on the original
Switchboard data has been preserved in some way.

To find out exactly what the corpus structure is, open 
swbd/swbd-metadata.xml.  (Many web browsers will make a display
for XML files that is easier to read than in, say, emacs, so do try
that first.)  

If you see

<code name="FOO">

Then foo is a valid data type.  The definition of FOO runs until
you see </code>, but

<code name="FOO"/>

is shorthand for

<code name="FOO">
</code>

If you see

<attribute name="BAR">

then the containing code has that attribute, so you can say

($n FOO):($n@BAR).

Enumerated attributes must choose a value from the given list;
otherwise they can be free-value strings or numbers.

The layer structure defines the permissible relations among data types
(or codes).  Each layer can be uniquely identified by name and defines
a set of data types that are interchangeable in the structure because
they can occur in the same positions.  A structural layer can point to
another layer, which means that elements with data types in the former
layer have children with data types drawn from the data types in the
latter layer.  If it points recursively, then there are any number
of layers of the former type ending in one of the latter type (this
is handy, say, for syntax).

If you see

<pointer number="BAR" role="BAZ" target="BAM"/>

then the containing code can have pointers with role BAZ where the
element pointed to has a data type drawn from the layer BAM.  BAR can
be an integer (pointer points to exactly that many elements), *
(points to zero or more; i.e. Kleene *) and + (points to one or more).
But I'm not sure how well the implementation enforces these number 
definitions.  The design is meant to restrict pointers to featural
layers, but the implementation is actually more flexible, with pointers
allowed anywhere.

Layers are themselves separated into codings.  This isn't very
important for this corpus (the codings are what allows for multi-rooted
trees) but it does tell you what file to look in for the elements
of a particular type; each coding is stored in a different XML file.
Where files need to refer to each other, they use stand-off annotation.

For instance, 
<nt cat="INTJ" nite:id="s1_500">
  <nite:child href="sw2065.terminals.xml#id(s1_1)" /> 
  <nite:child href="sw2065.terminals.xml#id(s1_2)" /> 
</nt>

means the nt dominates/has as children two elements, the ones
in the file sw2065.terminals.xml with the ids s1_1 and s1_2.

Domination can also be represented in a single file by containment:

<foo>
   <baz/>
</foo>

means foo dominates baz.

 <nite:child href="sw2065.terminals.xml#id(s1_1)..id(s1_5)" /> 

means *all* elements between s1_1 and s1_5 in the named file
regardless of type or id and is only defined if s1_1 and s1_5
are sisters under the same element.

The file syntax for pointers is very similar; e.g.

 <markable nite:id="sw2062.markable.1" animacy="nonconc">
    <nite:pointer role="at" href="sw2062.syntax.xml#id(s1_502)" /> 
 </markable>

The metadata contains everything one needs to know about corpus 
structure, but some people find it easier to look at sample data
itself.

======================================================
	Advanced Examples
======================================================

---------------------------------------------------------
	Some fairly simple queries
---------------------------------------------------------

Find all non-boundary words inside turns:

   ($w word)($t turn): $t^$w && START($t)<START($w) && END($t)>END($w)

Find each turn whose start is later than its end (should have no results!):

   ($t turn): START($t)>END($t)

All words that are both children of an NP and old

   ($np nt) ($m markable) ($w word): ($np@cat=="NP") &&
      ($m@status=="old") && ($np ^ $w) && ($m >"at" $np)

Find all phonwords containing an "ax" phoneme:

   ($pw phonword)(exists $ph ph): TEXT($ph)=="ax" && $pw^$ph

Find all phonwords containing an "ax" phoneme in an unstressed
syllable:

   ($pw phonword)(exists $s syllable)(exists $ph ph): TEXT($ph)=="ax" &&
      $pw^$s && $s@stress=="n" && $s^$ph


---------------------------------------------------------
	Word -> Phonword Mapping
---------------------------------------------------------

Being sourced from two versions of the transcript, the NXT Switchboard
corpus has to provide a link between the two and the different levels of
annotation they comprise. Linking the Penn and MS-State annotation
structures is an NXT pointer, which points from the word terminals to
a parallel annotation called 'phonword'. Users need to describe this
when constructing queries involving both transcripts. For instance, if
you wanted to extract all verbs and their primary stresses, the query
would look like this:

   ($w word)($pw phonword)($s syllable): ($w@pos ~ /V.*/) &&
      ($s@stress=="p") && ($w > $pw) && ($pw ^ $s)

That is, you need a separate variable for the word in each of the two
transcripts ($w and $pw), as well as representing the relationship
between them ($w > $pw). Users then need to be careful to query the
attributes and relationships they are interested in in relation to the
right sort of word, e.g. here the part-of-speech information (pos) is
an attribute of word elements, while syllable elements are situated
underneath phonword elements.


---------------------------------------------------------
	Dialogue - Speaker - Topic
---------------------------------------------------------

A dialogue has pointers to a topic, and two speakers (with pointer
roles 'topic', 'A' and 'B').

For example to search for the topics of dialogues where Northern dialect is used:

   ($d dialogue)(exists $s speaker): $d>$s && $s@dialect="NORTHERN" :: ($t topic): $d>$t

The knowledge of which speaker is which within a dialogue is
inherent in the names of the pointers ('A' or 'B'); it cannot be
stored in the speaker elements as speakers appear in multiple
dialogues. Therefore it is necessary to distinguish the speakers
within the query if you wish to use this information:

    ($d dialogue)($sa speaker)($sb speaker): $d>"A"$sa && $d>"B"$sb

Find dialogues with same-sex speakers:

   ($d dialogue)($sa speaker)($sb speaker): $d >"A" $sa && $d >"B" $sb && $sa@sex == $sb@sex

---------------------------------------------------------
	Markables
---------------------------------------------------------

In the NXT data, "markables" were added automatically to form the basis
of the information structure and animacy codings.  The researchers involved
used a predecessor of the current Index command line utility, running it
using the following two queries, in order:


/* FIRST QUERY */
($n nt)(forall $up nt):
     (($n@cat == 'NP') or ($n@cat == 'WHNP')) /* match only NP or WHNP's */
     and                                      
     (not                                     /* without the banned subcats */
        (($n@subcat  ~ /.*ADV.*/) or          
         ($n@subcat ~ /.*LOC.*/) or 
         ($n@subcat ~ /.*TMP.*/) or 
         ($n@subcat ~ /.*DIR.*/) or 
         ($n@subcat ~ /.*UNF.*/)))                 /* (including UNF, or unfinished, as banned) */
     and                                       /* where in addition forall nts */
     ((($n != $up) and($up ^ $n)) ->               /* if it truly dominates the match then */
      ((not ($up@cat == 'EDITED')) and           /* it isn't EDITED (part of a disfluency) and */
       (not                                     
        (($up@cat == 'ADVP') and              /*  it isn't ADVP with a banned subcat */
         (($up@subcat ~ /.*LOC.*/) or 
          ($up@subcat ~ /.*DIR.*/) or                /* (note the ADVP can be UNF, we don't mind) */
          ($up@subcat ~ /.*TMP.*/)))) and  
       (not                                    /* and it isn't a PP with subcat TMP */
        (($up@cat == 'NP') and 
         ($up@subcat ~ /.*TMP.*/))) and                                   
       (not                                    /* and it isn't a PP with subcat TMP */
        (($up@cat == 'PP') and 
         ($up@subcat ~ /.*TMP.*/)))))

The first query means "NPs and WHNPs that aren't adverbials,
locatives, directives, or unfinished, and where there aren't any
dominating nts marked as EDITED (that is, disfluent) or as locative or
directive adverbials."

/* SECOND QUERY  */
($w word)(exists $n nt)(exists $m markable)(forall $up nt):
     ($w@pos = 'PRP$') and ($n ^ $w) and ($m >'at' $n) and 
     (($up != $n) -> (not (($n ^ $up) and ($up ^ $w))))

The second query means "Possessive pronouns where the first nt you get
to by climbing up counts as a markable. "

The "active" codes similarly were added automatically, based on the syntactic
coding.

---------------------------------------------------------
	Disfluency Structure
---------------------------------------------------------

When querying the disfluency format, users should be aware that
disfluencies can be nested. Thus to retrieve all the terminals within
a disfluency it is necessary to look for all terminals which a
disfluency dominates, not just those it directly dominates. For
example, the query:

   (exists $d disfluency)($w word): $d^$w

will find all word terminals which are part of a disfluency. The same
thing without the existential qualifier:

   ($d disfluency)($w word): $d^$w

will find all word - disfluency pairs where the former is
contained in the latter. Note that this will likely produce duplicate
words, due to the nesting of disfluencies.

Find repairs containing the word "but":

    (exists $d disfluency)(exists $w word)($rep repair): $w@orth="but" && $d^$rep && $rep^$w


---------------------------------------------------------
	Use of SearchandFilter
---------------------------------------------------------

The NXT tool SearchAndFilter allows you to query the data, and then
output attributes of the variables queried in tab delimited format:

   java SearchAndFilter -corpus swbd-metadata.xml -observation sw2145 
      -query '($w word) ($pw phonword) ($a accent): ($w@pos ~ /V*/) &&
              ($a@type=="nuclear") && ($w > $pw) && ($a > $pw)' 
      -filter '$w@orth' '$w@nite:start' '$w@nite:end' '$a@nite:start'