English CTS Treebank with Structural Metadata
|English CTS Treebank with Structural Metadata
|Ann Bies, Haejoong Lee, Stephanie Strassel, Christopher Walker
|LDC Catalog No.:
|January 16, 2009
|two channel ulaw
|natural language processing
LDC User Agreement for Non-Members
|Subscription & Standard Members, and Non-Members
|Bies, Ann, et al. English CTS Treebank with Structural Metadata LDC2009T01. Web Download. Philadelphia: Linguistic Data Consortium, 2009.
English CTS Treebank with Structural Metadata, Linguistic Data Consortium (LDC) catalog number LDC2009T01 and isbn 1-58563-476-X, consists of metadata and syntactic structure annotations for 144 English telephone conversations, or 140,000 words, from data used in the EARS (Effective, Affordable, Reusable Speech-to-Text program. English CTS Treebank with Structural Metadata was created to support EARS work in English. It applies EARS metadata extraction annotations and Penn Treebank methods to conversations from Switchboard-1 Release 2 (LDC97S62) and from data collected for EARS under the Fisher Protocol (released in EARS as LDC2004E16, LDC2004E29 and LDC2005E73).
The purpose of the EARS program was to develop robust speech recognition technology to address a range of languages and speaking styles. LDC provided conversational and broadcast speech and transcripts, annotations, lexicons and texts for language modeling in each of the EARS languages (Arabic, Chinese, English). LDC also supported a metadata extraction (MDE) research evaluation, the goal of which was to enable technology to take raw speech-to-text (STT) output and to refine it into forms of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: removing non-content words like filled pauses and discourse markers from the text; removing sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation and standardized spelling, plus sensible conventions for representing speaker turns and identity are further elements in the readable transcript. Some of the data developed by LDC for the MDE task is contained in the LDC Catalog, i.e., RT-04 MDE Training Data Speech, LDC2005S16 and RT-04 MDE Training Data Text/Annotations, LDC2005T24.
The telphone speech used in English CTS Treebank with Structural Metadata was drawn from Switchboard-1 Release 2 (LDC97S62) and from data collected for EARS under the Fisher Protocol (released in EARS as LDC2004E16, LDC2004E29 and LDC2005E73). The speech for all files was recorded on two channels with a sampling rate of 8000 Hz and was encoded in ulaw format.
The Fisher data was transcribed by LDC staff; for the Switchboard data, transcripts developed at the Institute for Signal and Information Processing at Mississippi State University were used.
Structural Metadata Annotation
The transcribed data was annotated to SimpleMDE V6.2 , an annotation task defined by LDC that consisted of the following elements: Edit Disfluencies (repetitions, revisions, restarts and complex disfluencies), Fillers (including, e.g., filled pauses and discourse markers) and SUs, or syntactic/semantic units. Each of these elements is described below:
- Edit Disfluencies: Edit disfluencies, or speech repairs, occur when speakers correct or alter their utterances or abandon them entirely and start over. Edit disfluencies have a more complex internal structure than fillers, consisting of the original utterance (reparandum), an interruption point, an optional editing phase and a correction. There are four types of disfluencies annotated in SimpleMDE: repetitions; revisions; restarts; and complex disfluencies, which consist of multiple or nested edits. In SimpleMDE, annotators labeled only the deletable region (DELREG) of the disfluency which corresponded to the reparandum. In cases where the reparandum contained multiple disfluent utterances, annotators identified the maximal extent of the disfluent portion, starting with the left edge of the first disfluency and continuing to the right edge (IP) of the final disfluency.
- Fillers: While the term filler has traditionally been synonymous with filled pause, SimpleMDE uses the term to encompass a broad set of vocalized space-fillers: filled pauses (FPs), discourse markers (DMs), explicit editing terms (EETs) and asides/parentheticals (A/Ps). Excepting the last category, fillers can be understood as words that do not alter the propositional content of the material into which they are inserted. For example, FPs include nonlexemes, such as um or ah, that speakers use to indicate hesitation or to maintain control of a conversation. A DM is a word or phrase that functions primarily as a structuring unit of spoken language, such as actually, now, anyway, see, basically, so, I mean, well, let's see, you know, like, you see. DMs often signal the speaker's intention to mark a boundary in discourse, like a change in speaker or the beginning of a new topic. There is no exhaustive list of DMs for a given language due to their wide range of functions, colloquial variations, and the difficulty of defining them precisely. Furthermore, words that are used as discourse markers can be used for other purposes. EETs occur during an edit disfluency and consist of an overt statement (e.g., I'm sorry) from the speaker recognizing the disfluency. Asides and parentheticals (A/Ps) are different from the other filler types in that they convey semantic information in the form of a short side comment before returning to the main topic. This may be either on a new topic (asides) or on the same topic of the larger utterance (parentheticals). Both break up the stream of discourse and are often accompanied by noticeable prosodic features.
- Syntactic Units: One of the goals of MDE annotation is the identification of all units within the discourse that function to express a complete thought or idea on the part of the speaker.Within MDE these elements are called SUs (Syntactic, Semantic or Slash Units). As with disfluency annotation, the goal of SU labeling is to improve transcript readability by presenting information in small, structured, coherent chunks. There are four sentence-level SUs. Statements are complete SUs that function as a declarative statement and are marked with /.; questions are complete SUs that function as an interrogative and are marked with /?. Backchannels are an open class of words uttered by the non-dominant speaker to indicate engagement in the conversation and are marked with /@. Incomplete SUs occur when an utterance does not constitute a grammatically complete sentence, phrase or continuer, and does not express a complete thought; these are marked with /-. To enhance inter-annotator consistency, there are also sentence-internal clausal and coordinating SUs (/, and /&).
Parsing and Treebank Annotation
The existing MDE annotations were converted from RTTM format into a format appropriate for the automatic parser, enabling the generation of accurate parses in a form that would require as little hand modification by the Treebank team as possible. RTTM is a format developed by NIST (National Institute for Standards and Technology) for the EARS program that labeled each token in the reference transcript according to the properties it displays (e.g., lexeme versus non-lexeme, edit, filler, SU). The initial parse trees were produced using an entropy-based parser, which was trained on Switchboard transcripts supplemented with Wall Street Journal data (with a 4:1 ratio). These parses served as the starting point for a manual process which corrected the initial pass for each conversation.
To provide high quality parses, scripts were used to separate the edited material from the fluent part of each SU prior to parsing it using the MDE annotations. The edits were then parsed and reinserted into the tree for presentation to the annotators. Some important issues are listed below:
- Words were tokenized in Syntactic Units using LDC's scripts.
- All of the punctuation provided in the markup was maintained in the SU for parsing because it was likely to enhance parse accuracy and was expected to appear in the final tree annotations.
- For parsing complex edits, contiguous edits were concatenated into one unit for parsing. In a few cases, edits occur across SUs in MDE annotations.
- Special treatment was required in the scripts for regions unannotated for MDE, complex edits, and SUs that were comprised solely of edited material.
- The string was "EDITED" as the non-terminal tag for edit regions inserted into the fluent parse trees. Additionally a terminal node for the IP ((DISFL-IP +) was added at the end of the edits in an attempt to make the tree follow the conventions used in the Switchboard Treebank.
Manual treebank annotation was performed in accordance with existing treebank guidelines for conversational telephone speech as well as in accordance with revised general guidelines for treebanking.
For an example of the data in this corpus, please listen to this audio sample (wav) and view its parse tree (PDF). Note that the opening greeting of the conversation has been omitted in the parse tree. Only the discussion on holidays is present.