------------------------------------------------------------------------------- Changes in v20040405: * section (4): added segment feature "category" Changes in v20040316: * section (4): added token category "unintelligible" Changes in v20040218: * depod -> delreg Changes in v20040209: * section (6): added su_tokens() * section (4): added "oldType" feature to the SU and "unannotated" to the values of the "type" feature of SU Changes in v20040126: * section (4): added "responsiveDM" feature to the discourseMarker Changes in v20040123: * section (2), (4): added morpheme * section (6): added add_morpheme() Changes in v20040116: * section (4): "properName" (one of the values of the token's category feature) is now "properNoun" * section (4): added channel * section (6): added feature_names(), copy(), disable_type_feature_check and enable_type_feature_check() Changes in v20040109: * intial document ------------------------------------------------------------------------------- The MDE AG format (v20040405) (1) Preliminaries * List of supported annotations Skeletal annotations: file channel speaker section backgroundNoise turn segment token morpheme Metadata annotations: SU aside delreg discourseMarker explicitEditingTerm filledPause ipLeft ipRight noRTMetadata questionableTranscription * Pointers (Note: AGSet, AG, Annotation and Anchor with capital A's refer to the data objects in the LDC's implementation of Annotation Graph.) This format heavily utlizes the pointers to realize the hierarchy of annotations and horizontal links. Pointer is just a feature of an Annotation whose value is another annotation id. The hierarchy is built by a child annotation's having a pointer to its parent. The horizontal links, which is used to create a token stream within a speaker, is realized by a token annotation's having a pointer to its right sibling. (2) Hierarchy of skeletal annotations file | \ | \ | \ channel section / | | / | | / | | backgroundNoise | | speaker . | / | / | / turn | | | segment | | | token | | | morpheme The "file" represents an audio recording or a transcript. It's represented with an AGSet. The "channel" corresponds to an audio channel and is represented with an AG. The "section" identifies a section of the broadcast. The "section" annotations exist only in the broadcast news files, and they are kept in a special AG whose id ends with ":_S". The "speaker" represents speakers on an audio channel. The "turn", "segment" and "token" annotations have a pointer to a "speaker" annotation. The "token" annotations in a "speaker" form a stream in the chronological order. The "backgroundNoise" corresponds to a part of an audio signal that is identified as a background noise, which is a property of the underlying audio signal. It is implemented as an Annotation that shares Anchors with "token" or "segment" anntotations. (3) Hierarchy of metadata annotations su: SU as: aside de: delreg dm: discourseMarker ee: explicitEditingTerm fp: filledPause il: ipLeft ir: ipRight no: noRTMetadata qt: questionableTranscription speaker | .---.---.---.---+---.---.---.---.---. | | | | | | | | | | | | | | | | | | | | su as de dm ee fp il ir no qt | | | | | | | | | | | | | | | | | | | | '---'---'---'---+---'---'---'---'---' | token The "SU" annotation just marks the last "token" annotation of the semantic unit. The complete list SU tokens can be recorvered by means of "_next" pointer of tokens. Other annotations mark all tokens within their range using pointers. (4) Available features and values for annotations * Skeletal annotations backgroundNoise --------------------- None channel -------------------- Feature: channelId value: 1, 2, 3, ... morpheme -------------------- Feature: text value: (any string) section -------------------- Feature: category value: nonnews story teaser segment -------------------- Feature: type value: non-speech unintelligible Feature: category value: unannotatedOverlap speaker -------------------- Feature: name value: (string) Feature: native value: native non-native Feature: speakerId value: (string) Feature: type value: child female male other token -------------------- Feature: category value: acronym filledPause idiosyncratic mispronounced postFragment preFragment properNoun semiIntelligible spokenLetter unknownForeignWords unintelligible vocalNoise Feature: language value: (string) Feature: punctuation value: comma period question Feature: text value: (string) turn -------------------- None * Metadata annotations SU --------------------- Feature: comments value: (any string) Feature: difficultToAnnotate value: false true Feature: type value: backchannel clausal coordinating incomplete question statement unannotated Feature: oldType value: backchannel clausal coordinating incomplete question statement unannotated aside --------------------- Feature: comments value: (any string) Feature: difficultToAnnotate value: false true Feature: type value: 0 1 delreg --------------------- Feature: comments value: (any string) Feature: difficultToAnnotate value: false true discourseMarker --------------------- Feature: comments value: (any string) Feature: difficultToAnnotate value: false true Feature: responsiveDM value: false true explicitEditingTerm --------------------- Feature: comments value: (any string) Feature: difficultToAnnotate value: false true filledPause --------------------- Feature: comments value: (any string) Feature: difficultToAnnotate value: false true ipLeft --------------------- None ipRight --------------------- None noRTMetadata --------------------- Feature: comments value: (any string) questionableTranscription --------------------- Feature: comments value: (any string) (5) Low level, structural features of annotations These are features that hold structural information of annotations. If the RT API is used, there is no need to access these features, thus they are hidden from the user. _sn A serial number. The larger the number, the later located the annotation in chronological order. _st A pointer to the first token child. _et A pointer to the last token child. _next A pointer to the next token within the segment. Only available for tokens. If the value is "", then the token is the last one in the segment. _type* A parental pointer. Examples: _speaker*, _section*, _turn*, _SU*, _ipLeft*, etc. (6) RT API functions by category This API has been developed to alleviate the burden of processing the low level details of the file format, thus making the data production more efficient. These are current list of functions, but it will keep changing until things are stabilized. * File I/O fileId load_transcript_swb(filename) fileId load_transcript_hub4(filename) fileId load(filename) save(fileId, filename) save_ag(fileId, filename) save_cag(fileId, filename) save_cagz(fileId, filename) string save_format(fileId) * Navigation, querying list channels(fileId) list speakers(channelId) list sections(fileId) list section_turns(sectionId) list section_turns(sectionId, channelId) list children(id, type) list tokens(id) list su_tokens(id) id parent(id, type) channelId channel(id) fileId file(id) tokenId next_token(id) tokenId prev_token(id) tokenId first_token(id) tokenId last_token(id) bool has_start_offset(id) bool has_end_offset(id) double start_offset(id) double end_offset(id) string type(id) string feature(id, feature) bool exists_feature(id, feature) set feature_names(feature) set by_feature(channelId, feature, value, type) set by_feature(channelId, feature, value) * Modification set_feature(id, feature, value) set_features(id, map fv_pairs) delete_feature(id, feature) delete_file(fileId) delete_ann(id) bool set_start_offset(id, offset) bool set_end_offset(id, offset) * Construction fileId create_file(fileId) channelId create_channel(fileId) speakerId create_speaker(channelId) sectionId add_section(fileId) turnId add_turn(speakerId) segmentId add_segment(speakerId) tokenId add_token(speakerId) morphemeId add_morpheme(tokenId) id create_ann(tkn1,tkn2, type) id create_ann_on_channel(channelId, start, end, type) id create_ann_on_channel_ss(channelId, ann1, ann2, type) id create_ann_on_channel_se(channelId, ann1, ann2, type) id create_ann_on_channel_es(channelId, ann1, ann2, type) id create_ann_on_channel_ee(channelId, ann1, ann2, type) bool copy(id, id) disable_type_feature_check() enable_type_feature_check()