HCRC Map Task Editorial Conventions and Markup Structure

HCRC Map Task Editorial Conventions and Markup Structure Henry S. Thompson UK Economic and Social Research Council &HCRC.dist;

Electronic original

Plain ascii text, with spaces and tabs used for formatting

HCRC Map Task Editorial Conventions and Markup Structure Henry S. Thompson Considerable care has been taken to make the form of the transcripts as simple, regular and useful as possible. The invariants observed are described here, and the various sorts of markup used are explained. The transcripts consist of a sequence of turns, separated by blank lines. Each turn is composed of markup lines and text lines. Markup lines always begin with left angle bracket (<). Text lines always begin with either an alphabetic character (a-z, A-Z) or a left brace ({). Text lines never begin with left angle bracket. I. Text properties. Leaving aside markup for the time being, the text observes the following invariants: Text is composed of words, separators, punctuation, discontinuities and microtags. Words are composed of alphabetic characters (a-z, A-Z), internal hyphens (-)and internal or final apostrophes ('). The only separator characters used are space ( ) and new-line ( ), for which linefeed, character code 10, is used. The only punctuation characters used are comma (,), full stop (.), question mark (?) and exclamation mark (!). A discontinuity is notated with three full stops (...). A microtag consists of an open brace ({), a two letter tag type, a vertical bar (|), its contents (a sequence of words and/or microtags) and a close brace (}), e.g. {ab|partia}. Normal orthographic conventions are observed, that is, words are separated by spaces or new-lines, with commas adjacent to the preceding word and followed by space or new-line. Sentences begin with an upper-case letter, and end with full stop, question mark or exclamation mark. However, not all turns end with sentence-final punctuation, either because of overlap (see below) or because in the transcriber's judgement the turn ended in mid-sentence. For similar reasons, not all turns begin with an upper case letter. Capitalisation indicates either proper name, 1st person singular pronoun and/or sentence start. Sentence start is a subjective matter, determined by transcribers as they saw fit. It occurs in four places: 1) At the beginning of turns (sometimes, i.e. turns may start with a capital letter, but need not); 2) after sentence end (marked with period, question mark or exclamation mark and two spaces) (always); 3) after an unfilled pause (marked with three-dots and a space) (sometimes) or rarely a filled pause (marked with a microtag); 4) after an abandoned word which also marks the abandonment of a sentence (marked with an 'ab' microtag) (sometimes). There are no tab characters and no line-final or line-initial spaces. Leaving aside "...", full stop, question mark and exclamation mark occur only sentence finally, that is, followed by two spaces or new-line. Two spaces in a row occur only following sentence-final punctuation. More than two spaces in a row never occur. The following characters do not appear at all: ; : @ # $ % ^ ( ) [ ] + \ " ` ~ / The following characters appear only within curly braces: = | Hyphen appears only within words, that is, surrounded by alphabetics In particular, it is NOT ever followed by end-of-line. Single letters (other than as the words "a" and "I") and digits occur only within {}s. Single quote appears only for apostrophe, within or at the end of words. II. Microtags Microtags are used to annotate portions of the transcript which for some reason merit setting off from ordinary running text/speech. All microtag types are lowercase, as follows: ab abandoned for words broken off before their natural end. Standard orthography is used. br broken for words broken off in the middle, then completed. ci cited word for feature names printed on the maps when evidently mentioned, not used. fp filled pause see below gg 'grunt' see below fg see below ip initial partial for words intentionally missing their first syllable le letter name for single letters or digits, where their names were spoken ph phonetic for non-standard non-words, or actual words with errors or extreme perturbations in pronunciation. Informal phonetic spelling is used. rp repeated for the repetition of less than a complete word. In the case of br, ip and sometimes ph, the actual contents are followed by an equal sign (=) and the intended or full word involved, i.e. {br|dir rectly=directly}. Capitalisation, spacing and punctuation for microtags is on the basis that they are really part of the transcription. That is, if you delete the braces, vertical bar, equal sign and EITHER what is left or right of the equal sign, you will get text which conforms to the other invariants, e.g. "the {le|c} in {ci|crane bay}. {gr|Uh-huh}." --> "the c in crane bay. Uh-huh." Note that although in a few places transcribers used paired quotation marks, none remain, as all were either converted to {ci ...} as they marked feature names, or, in a very few cases, were removed altogether, as they were marking either some form of emphasis or some form of direct speech, e.g. "you said 'draw around the cottage'", but as this had not been done at all consistently, it was felt better to do without. In two cases, namely /tIl/ for "until" and /kuz/ for "because", we have felt that some recognition of elided first syllables was necessary, so we have used the ip microtag as follows whenever these occur: {ip|til=until}, {ip|cause=because}. In the area of filled pauses and 'non-words' such as "uh-huh", we have of necessity been somewhat arbitrary. Three micro-tags are used: fp Filled pause, almost always occurring in mid-stream. Inventory: ehm, erm, er, och, uh gg Grunts, almost always occurring as utterances on their own or as adjuncts. Really no different from e.g. "yes", "no", "okay", except in phonology and 'official' status. Inventory: aha, hmm, mm-mm, mmhmm, oops, phew, ugh, uh-huh, uh-uh, whoops fg Items ambivalent between the above two, i.e. occurring either as independent utterances/adjuncts or as filled pauses, e.g. Inventory: ah, eh, mm, oh, oo, um For example, as an utterance and adjunct: "{fg|Oh}, I see. {fg|Oh}." and as a filled pause: "it's going to be, {fg oh} let me see" Neither this three-way distinction, nor the apparent phonetic content of the items themselves, should be taken too seriously, as there is considerable phonetic overlap between e.g. ehm, erm and um, and for instance {fg|oh} occurs mostly as either an utterance on its own or as part of e.g. "{fg|Oh} [expletive]", only occasionally as a clear filled pause. The microtag 'ph' has been used for quasi-phonetic spelling, nearly always one-off. These fall into three broad categories: 1) expressive pronunciation, e.g. rrrright, you'llllll, which are usually given exact versions following the equal sign; 2) slips of the tongue, e.g. spaceslip, springbook, again with exact glosses provided; 3) miscellaneous noises, e.g. bddllpp, ssh, uu-uu, prrr. III. Words In general a wide range of phonetic variation is concealed in the use of standard orthography. In two cases, "gonna" and "kinda", where a semi-standard orthography for a common fast-speech variant exists, and many transcribers have used it, this has been allowed to stand. However, no consistency checking has been done, and there is every likelihood that the range of pronunciations covered by these transcriptions overlaps significantly with those transcribed "going to" and "kind of". Similarly, with Scottish dialect spellings such as "gonnae", "wouldnae", "cannae", "doon", there is no guarantee that in particular instances or in general these are reliably distinct from instances transcribed with "going to", "gonna", "wouldn't", "cannot" or "down", although there is certainly a dialect-conditioned difference here to be appealed to. All of these cases show up in the non-standard spellings list (etc/oddwords) (see below). On the other hand, more ad-hoc indications of fast speech, short of genuine mish-mash annotated with the {ph ...} microtag, have been regularised. Thus transcriptions (all quite rare) such as "gotta", "dunno", "int'it", "d'you" and "s'okay" have been replaced with "got to", "don't know", "isn't it", "do you" and "that's okay". Also note that different transcribers used "yep", "yup" and "yeah" for what is broadly the same affirmative 'word', which has been uniformly presented here as "yeah". The file etc/oddwords gives a complete list of the words (or in a few cases the roots of words) which appear in the transcripts which are in some way non-standard, giving in each case some gloss or explanation. In particular, it contains all words which are thrown out by UNIX(TM) spell -b. As an aid to further processing, the file etc/maptask.spl is a compressed hashed spelling list, suitable for passing to spell with the -d switch, which includes all the necessary additions to /usr/lib/spell/hlistb to allow the text of all the transcript files to pass through spell with no errors, aside from the contents of the micro-tags. A sequence of three periods "..." is used to indicate unfilled pauses and/or significant disjuncture. As the latter is a pre-eminently subjective phenomenon, transcribers varied in their use of "...", and it should neither be assumed that all pauses and disjunctures are marked, or that "..." has a uniform interpretation across all transcripts. Lest the above seem very weak, it should be emphasised that the transcriptions are intended to serve at least two distinct purposes: On the one hand, they should allow a relatively superficial reading to give a rough impression of what was said, and of HOW it was said. It is for this goal that "..." and indeed the use of standard orthographic punctuation of the material into clauses and sentences is included. On the other hand, the transcripts are an indication of the words uttered and an index into the sampled audio. For these purposes, and for any serious linguistic investigation, all punctuation and case-shifting in the transcripts should be removed or ignored, and even lexical identity taken with a grain of salt. IV. Markup Although we have used an early draft of the TEI P2 chapter on the transcription of spoken material (Fascicle 34) and the DTD included therein, we have actually made use of very few of the tags defined there, to the point where those not interested in TEI or SGML can understand all they need to on the basis of the brief descriptions provided hereafter. The 'text' tag surrounds the entire transcript, and gives the conversation id as the value of the 'id' attribute. There are four structural tags within the transcript: 'u', 'sfo', 'bo' and 'eo'. The transcript is composed of a sequence of turns, delimited by 'u' tags, which indicate the talker as the value of the 'who' attribute (G for Giver, F for Follower), and the utterance number within the conversation as the value of the 'n' attribute. Note that as each conversation has an id, and each turn has a number, to refer to an individual turn in a standard way, use MTC1:conv-id:turn-no, e.g. MTC1:q4nc3:32 is the Instruction Follower saying "No. Sorry." This is formally given as the Reference System for the corpus in the corpus header 'refsDecl' section. Many turns also have a 'sfo' tag, for Speech File Offset, which provides a pointer into the sampled speech file for the transcript, via a sample number which is the value of the 'samp' attribute. These pointers are not exact, in that they were semi-automatically produced on the basis of a silence detector, but they are appropriate for positioning playback in an area of interest as determined by the transcript (see the description of the player utility tool in the src directory). The 'bo' and 'eo' tags, which always occur in pairs with matching values for the 'id' attribute, delimit regions of overlapping speech as marked by the transcribers. The transcription within the delimited region cannot be relied on as regards the sequencing of the two talkers. We have not provided any more detailed transcription of overlap, and indeed only included the bo/eo marking as rough guidelines as to regions to be avoided by those interested only in clean single stream speech. Given that the overlap marking is usually confined to whole turns, it is certainly not the case that all speech within the delimited regions is overlapped, and contrarily, given the fallibility of transcription and the variation in opinion as to what constitutes overlap which exists across transcribers, not all areas outside the delimited regions are free from overlap. Here as in many other cases we provide what we have found useful, given the level of resources available, but certainly would not claim it represents a definitive statement. We have used the non-SGML/TEI microtag notation for two reasons: to facilitate the readability of the transcripts as they stand by making markup within turns as light-weight as possible, and because the TEI approach turns what we think should remain as text into attribute values. Thus for example we have {fp|hmm}, the TEI would have used the 'vocal' tag with "hmm" as the value of the 'desc' attribute. We have also provided for optional expansion of the microtags into full tags using the SGML SHORTREF feature -- see lib/tei/maptaskt.dtd for details. Unfortunately, the desire to differentiate clearly between speech and/or other noise which actually occurred on the one hand and editorial commentary on the other, means that we have used a new tag 'editorial' with attribute 'text' for such cases, since the existing 'note' tag makes the wrong choice here, requiring editorial annotation to appear as text rather than as the value of an attribute. Completing the catalogue of non-standard tags, we have used 'unclear' to delimit regions which the transcribers indicated they could not be sure of, and in a few cases where they could not even guess, the pseudo-word "indecipherableSpeech" has been used, tagged with 'unclear'. We have used the standard TEI tags 'foreign' for the few cases on non-English words and phrases, 'sic' for unexpected usage and 'event' for sneezes, coughs, laughs and the like. Note that in the cases of 'unclear', 'foreign' and 'sic', we observe the invariant the start and end tags are on separate lines, with the enclosed text on a line of its own in between, whereas for 'editorial' and 'event', the substantive material is the value of an attribute, and thus embedded within the start tag. This is in keeping with our intention that all and only what was actually said by the talkers appears on lines without any SGML markup.