Annotation Information for English CTS Treebank with Structural Metadata

This readme file contains information on LDC2009T01, English CTS Treebank with Structural Metadata.

The eval, dev1 and dev2 sets have all been treebanked and QC'ed, and the mde04 files have been updated to refect several changes in tokenization as discussed and also to reflect the EDITED/IP information in the treebank files (this is corrected from V1.5, where only one channel's EDITEDS were actually copied into MDE).

The eval, dev2, and dev1 directories contain the following types of files (as determined by the final extension):

The mde04 directory contains subdirectories for eval, dev1 and dev2 containing the following types of files:

Switchboard Treebank guidelines have been followed as closely as possible. An addendum to the Switchboard Treebank guidelines is in SwitchboardTB-Addendum.txt (this includes a few policy changes, such as the introduction of NML to replace and improve upon the old NX and NP-internal NAC, and it also includes a number of specific examples of policy clarification).

Note on a known discrepancy between the MDE .rttm and STT files:

  1. Filled Pauses The primary issue is that the RTTM format allows filled pauses to be indicated in 2 distinct ways:
    1. the "fp" subtype of STT type "LEXEME"
    2. the "filled_pause" subtype of MDE type "FILLER"

In cases where (a) and (b) do not coincide, consider (b) to be the correct annotation. It is likely that these are primarily cases of transcription error that are too costly to fix effectively for this release.

Note on known discrepancies between the MDE .rttm and Treebank files:

  1. Capitalization

    The transcribers capitalized the word they thought started an SU. This capitalization in the transcript was not changed when the SUs were annotated such that the given word did not actually start an SU any longer, and the original capitalization remained.

    As a result, RTTM files have capitalization at the start of an SU when the transcriber and SU annotator agreed and capitalization elsewhere when they didn't. This is of particular note because the RT-04F evaluation reference condition left this capitalization in the data, allowing it to be exploited for metadata detection. Besides its obvious utility for SU boundary detection, it also proved useful for identifying edits and fillers, as these examples (from the first file of dev2) show:

At the same time this capitalization artificially improves metadata detection, it also makes it difficult to use this data set for named-entity detection since sentence-internal capitalization may reflect either a named-entity or an original transcriber SU boundary. As above, changes involving the transcripts are too costly to fix effectively for this release.


Contact: ldc@ldc.upenn.edu
© 2008 Linguistic Data Consortium, Trustees of the University of Pennsylvania. All Rights Reserved.