--------------
PROSODY FILES
--------------

This file gives information about how the prosody files (accents, phrases, breaks) relate to their source files, and known outstanding issues with the conversion process.

Please read the relevant sections (especially s5.2) of Calhoun et al. (submitted) for a general overview of this process BEFORE reading this file. It may also be helpful to look at the corpus website: http://groups.inf.ed.ac.uk/switchboard/

-------------
SOURCE FILES
-------------

There are 3 distinct types of source file for the xml prosody files. For each accent, phrase and break file, the type of source file used to generate the xml is indicated in a comment at the top of the xml, e.g. <!-- File generated from the UW files. -->:

EDINBURGH ORIGINAL: These files were annotated manually by a team at Edinburgh/Stanford using their own ToBI-inspired standards. The source files were Praat textgrids with word (MS-State transcript), accent and phrase tiers. Note for these files ONLY, not all words in the file are annotated for prosody - only those words in sentences also annotated for kontrast.

UW: Annotated manually by a team at the University of Washington, led by Mari Ostendorf using ToBI-lite standards (which differ somewhat from the Edinburgh/Stanford standards). The source files consisted of separate wordlist (including accent info), break and tone files generated by X-waves which are associated by timing information. ALL words in these conversations are annotated for prosody.

EDINBURGH CONVERTED: These files were originally annotated manually by UW, they were then manually converted to the Edinburgh/Stanford standards by a team at Edinburgh/Stanford. All words in these conversations are annotated for prosody.

--------
ACCENTS 
--------

All accents have a "strength" attribute (weak/full). 'Weak' corresponds to "?" in the UW annotations. Only Edinburgh Original and Edinburgh Converted files have a "type" attribute (plain/nuclear/pre-nuclear). The time the accent was marked at (usually the peak) is both nite:start and nite:end.

Accents 'point' at phonwords (words in the MS-State transcript). For the Edinburgh Original and Converted files, these pointers were determined from a manually-marked index linking the accent to a particular word (so the accent time can be before or after the word boundary times). All accents from all source files were included successfully and all have a word pointer.

For the UW files, the word pointer was determined automatically, i.e. it was set to be the word which the accent fell within according to its start and end time-stamps. There are potential errors in the output from this process, where the accent time was not marked within the boundaries of the word it was intended to be associated with. This could have been because the accent peak was actually before or after the word boundary (e.g. a late-rising peak). However, other errors could have resulted because it seems that the underlying transcript used by UW had slightly different word-timing information to the MS-State transcript (the phonwords). This could have led to false associations between accents and phonwords. It is not known how often this occurred.

For all files, there may be cases where a single phonword has two accents pointing at it. For the Edinburgh files, these are genuine cases where the annotator judged the word to have two accents. For the UW files, these may not be genuine for the reasons given above.

--------
PHRASES
--------

For the Edinburgh Original and Converted annotations, prosodic phrasing was conceptualised in terms of marking "phrases", rather than "breaks" (as is usual in ToBI). That is, groups of words were manually annotated as belonging to a "phrase", which was characterised as having a particular type according to the kind of break following the last word in the phrase, as follows:

[123]p = 'disfluent'
X = 'backchannel'
3 = 'minor'
4 = 'major'

Break indices below 2p/3 were not marked and no distinction was made between variations of -p (e.g. 1p, 2p, 3p) and 3 (e.g. 3-) or 4.

These phrases have phonword 'children' which were determined straight-forwardly from the manual annotation.

For the UW files, phrases were determined automatically from the XML break (see description of break generation below) and phonword files. Every element (phonword, noise, laughter...) in each phonword file was included in a phrase, according to the following algorithm: the first element in the phonword file started the first phrase. Elements were then added to the same phrase until at least one of these applied:

- the element had a break parent of the type 2p, 3p, 3, 3-, 4, 4- or X
- the following element in the phonword phrase was of a different type, e.g. phonword |
laughter (| = phrase end), phonword | noise, noise | phonword.
- the start time of the following element (of the same type) was more than 50ms after the
end time of the current element.
- it was the last element in the phonword file.

If the element met any of those criteria it was the last element in that phrase. The next element in the phonword file started the next phrase.

For cases where the phonword had 2 break parents (see below), if at least one of these was "phrase-ending" (i.e. in the list above, 2p, 3p, 3...), then the phrase was ended. 

The phrase type was determined using the following mapping, where the last word in the phrase did not have a break parent, no phrase type was entered.

2p, 3p = 'disfluent'
3, 3- = 'minor'
4, 4- = 'major'
X = 'backchannel'

For cases where the phonword had 2 break parents, if just one of them was "phrase-ending", the phrase type was taken from that break. For cases where both breaks were "phrase-ending", if they both mapped onto the same kind of phrase (e.g. X and X, or 3p and 2p), then that phrase type was used. If they mapped onto different phrase types, the following order of preference was used (e.g. if 2p and 4, use 'disfluent'), as a manual inspection of a sample of these cases indicated this best captured the correct phrase type:

1) 2p, 3p
2) 4, 4-
3) 3, 3-
4) X

There were a total of 28 cases of words with 2 "phrase-ending" break parents which mapped onto different kinds of phrase types.

We also did an automatic check of files in the UW set to see if any had incomplete prosody annotation, as some files in the whole UW set were known to. We checked if there was a sequence of 5 or more phonwords at the end of the file without a break parent. There were no files like this so it appears all the files in the final UW set are complete.

-------
BREAKS
-------

Breaks are included as a separate layer in the xml in an attempt to capture the richer break information in the UW files. Break files are therefore generated from the UW files where these are available (i.e. for both the UW files and Edinburgh Converted files the UW break information is used). It is anticipated that users will either use the phrase or the break files to do queries depending on their needs, not both. In particular, information in the phrase and break files for Edinburgh converted conversations may clash as they are generated from different sources.

Breaks 'point' at the phonword they follow. To achieve this, for all UW files, the break was first aligned with a word-end boundary in the MS-State transcript (phonwords). This process proved considerably more difficult than aligning tones with word boundaries, as breaks were closer together and differences in timing information between the MS-State transcript and that used by UW seemed to be greater phrase-medially than at phrase-ends. The matching was done in two stages: firstly a "near-exact" match was tried, where the break index time and phonword end-time were very close together. Then a more "fuzzy" match was tried, where a break was considered associated with a word if it fell somewhere between the start and end time for that word. There may be some errors where this matching process resulted in the break being associated with the word following or before the one intended. It is not known how often this happens.

One break could not be matched with a phonword because of the word identity conflicted:
[4152B UW] ERROR orthography mismatch MSW/BRK (break 235 at 298.898875): 'g[uilt]-'     'deciding'

There are also cases where two breaks associate with the same word, and therefore have the same NITE time. We believe that many of these cases arose because the underlying word was represented as two words in the transcript used by UW, and one in the MS-State transcript, e.g. "TI" versus "T" "I". Users need to account for this in interpreting query results.
 
The nite:start and nite:end time given for each break represent the end time of the phonword which they point at. Because the nite time and the timestamp in the original UW files are often quite different, we have also included the original timestamp in the UW files in a "UWtime" attribute. In cases where two breaks point at the same word, these breaks will have different UWtimes, reflecting the different break times in the source files. Note that in some of these cases the UWtimes do not appear in temporal order in the break files. 
Users may wish to consider using the UWtime to assess the accuracy of the break/phonword association.

The break index type from the UW files is entered in a "index" attribute.

Because tones are supposed to associate exactly with breaks in the ToBI system, we decided to include phrase accent and boundary tone information on break elements, rather than as a separate layer in the xml. This should also assist querying. Tones were first associated with MS-State word boundaries as described above for phrases. These were then associated with breaks on the basis of their timing information (breaks having also been associated with MS-State word boundaries). Most tones were matched with breaks in this way. However, this matching failed when the tone had not been associated with a MS-State boundary in the first place. Unmatched tones were then matched by matching the original UW tone time with the break time (the end of the phonword). Remaining breaks which could not be matched this way were matched by allowing a 500ms "window" around the MS-State word boundary end time. Tones could only be matched to 3, 4 and X breaks (and variants) in all cases. The resulting association of tones with breaks has been reasonably successful, however, there are approximately 77 tones across all UW conversations which could not be matched with a break using this algorithm, and are hence not represented in the xml files. These are listed in UNMATCHED_TONES.txt. Sources of unmatched tones include: where the original alignment of the tone to the MS-State word was incorrect, so the match to the break is not found; where two tones appear to align with the same break and only one is captured in the xml; where a tone was marked on a "disallowed" break, e.g. 2, 1, 2p. It is possible that some of these unmatched tones could be reduced with further improvements to the algorithm used. There may also be some errors where two eligible breaks are close together and the tone is matched with the wrong break.

The tone information itself was included in "phraseTone" and "boundaryTone" attributes, according to the following mapping (note that some very infrequent tones have been omitted - the disfluency markers 2p and r are already captured in the separate disfluency xml layer and break index attribute):

source  ptone   btone
!H-     !H
H-      H
!H-H%   !H      H
H-H%    H       H
!H-L%   !H      L
H-L%    H       L
L-      L
L-H%    L       H
L-L%    L       L
-X?     X?
X-?     X?
X%              X
X%?             X?
X-X%    X       X
X-X%?   X       X?
<       ignore
-?      X?
2p%     ignore
%H      ignore
H-L%_X%         X
%r      ignore
X%_X%           X

Finally, for completeness, breaks were generated for the Edinburgh original conversations automatically from the phrase information. That is, for the final word in each phrase a break was entered. The nite:start and nite:end for that break were the same as the nite:end for the original phrase, and the break index was determined automatically from the phrase type using the table in the phrase section above. Break information for the Edinburgh original annotations is accurate as far as we are aware, however, note that a much coarser break annotation was used (i.e. only 2p, 3, 4, and X), and there is no tone information.

----------
PROSNOTES
----------

These files exist for the Edinburgh original and converted files only. They contain notes made by annotators at Edinburgh/Stanford in a "Notes" tier while doing the prosody annotation. These files simply record the time the note was marked, and what the note was. Since these notes could refer to the accent tier, the phrase tier, or even the nature of the sound file itself (e.g. bad sound quality here), we have not attempted to align this layer with any of the other xml layers. This layer is included primarily for the sake of the historical record, and is anticipated to only be useful if future users of the corpus wish to revise the prosody annotation itself using these files.

--------------------------
Sasha Calhoun
20 March 2009