-------------------------------------------------------------------------------

Changes in v20040405:

	* section (4): added segment feature "category"

Changes in v20040316:

	* section (4): added token category "unintelligible"

Changes in v20040218:

	* depod -> delreg

Changes in v20040209:

        * section (6): added su_tokens()
        * section (4): added "oldType" feature to the SU and "unannotated"
        to the values of the "type" feature of SU

Changes in v20040126:

        * section (4): added "responsiveDM" feature to the discourseMarker

Changes in v20040123:

        * section (2), (4): added morpheme
        * section (6): added add_morpheme()

Changes in v20040116:

        * section (4): "properName" (one of the values of the token's category
        feature) is now "properNoun"
        * section (4): added channel
        * section (6): added feature_names(), copy(),
        disable_type_feature_check and enable_type_feature_check()

Changes in v20040109:

        * intial document

-------------------------------------------------------------------------------


                                The MDE AG format
                                   (v20040405)


(1) Preliminaries

    * List of supported annotations

    Skeletal annotations:

    file channel speaker section backgroundNoise
    turn segment token morpheme

    Metadata annotations:

    SU aside delreg discourseMarker explicitEditingTerm
    filledPause ipLeft ipRight
    noRTMetadata questionableTranscription


    * Pointers

    (Note: AGSet, AG, Annotation and Anchor with capital A's refer to
    the data objects in the LDC's implementation of Annotation Graph.)
                
    This format heavily utlizes the pointers to realize the hierarchy of
    annotations and horizontal links.  Pointer is just a feature of an
    Annotation whose value is another annotation id.

    The hierarchy is built by a child annotation's having a pointer to
    its parent.

    The horizontal links, which is used to create a token stream within
    a speaker, is realized by a token annotation's having a pointer to
    its right sibling.


(2) Hierarchy of skeletal annotations


                        file
                         |  \
                         |   \
                         |    \
                     channel  section
                      /  |     |
                     /   |     |
                    /    |     |
       backgroundNoise   |     |
                      speaker  .
                         |    /
                         |   /
                         |  /
                        turn
                         |
                         |
                         |
                      segment
                         |
                         |
                         |
                       token
                         |
                         |
                         |
                      morpheme


    The "file" represents an audio recording or a transcript.
    It's represented with an AGSet.

    The "channel" corresponds to an audio channel and is represented
    with an AG.

    The "section" identifies a section of the broadcast.  The "section"
    annotations exist only in the broadcast news files, and they are
    kept in a special AG whose id ends with ":_S".

    The "speaker" represents speakers on an audio channel.  The "turn",
    "segment" and "token" annotations have a pointer to a "speaker"
    annotation.

    The "token" annotations in a "speaker" form a stream in the
    chronological order.

    The "backgroundNoise" corresponds to a part of an audio signal
    that is identified as a background noise, which is a property of the
    underlying audio signal.  It is implemented as an Annotation that
    shares Anchors with "token" or "segment" anntotations.


(3) Hierarchy of metadata annotations

    su: SU
    as: aside
    de: delreg
    dm: discourseMarker
    ee: explicitEditingTerm
    fp: filledPause
    il: ipLeft
    ir: ipRight
    no: noRTMetadata
    qt: questionableTranscription


                   speaker
                      |
      .---.---.---.---+---.---.---.---.---.
      |   |   |   |   |   |   |   |   |   |
      |   |   |   |   |   |   |   |   |   |

      su  as  de  dm  ee  fp  il  ir  no  qt

      |   |   |   |   |   |   |   |   |   |
      |   |   |   |   |   |   |   |   |   |
      '---'---'---'---+---'---'---'---'---'
                      |
                    token


    The "SU" annotation just marks the last "token" annotation of the
    semantic unit.  The complete list SU tokens can be recorvered by
    means of "_next" pointer of tokens.

    Other annotations mark all tokens within their range using pointers.
    

(4) Available features and values for annotations

    * Skeletal annotations
    
    backgroundNoise
    ---------------------
    None


    channel
    --------------------
    Feature: channelId
      value: 1, 2, 3, ...


    morpheme
    --------------------
    Feature: text
      value: (any string)


    section
    --------------------
    Feature: category
      value: nonnews story teaser


    segment
    --------------------
    Feature: type
      value: non-speech unintelligible
    Feature: category
      value: unannotatedOverlap 


    speaker
    --------------------
    Feature: name
      value: (string)
    Feature: native
      value: native non-native
    Feature: speakerId
      value: (string)
    Feature: type
      value: child female male other


    token
    --------------------
    Feature: category
      value: acronym filledPause idiosyncratic mispronounced
             postFragment preFragment properNoun
             semiIntelligible spokenLetter unknownForeignWords
             unintelligible vocalNoise
    Feature: language
      value: (string)
    Feature: punctuation
      value: comma period question
    Feature: text
      value: (string)


    turn
    --------------------
    None
    
    
    * Metadata annotations

    SU
    ---------------------
    Feature: comments
      value: (any string)
    Feature: difficultToAnnotate
      value: false true
    Feature: type
      value: backchannel clausal coordinating incomplete question statement
             unannotated
    Feature: oldType
      value: backchannel clausal coordinating incomplete question statement
             unannotated


    aside
    ---------------------
    Feature: comments
      value: (any string)
    Feature: difficultToAnnotate
      value: false true
    Feature: type
      value: 0 1


    delreg
    ---------------------
    Feature: comments
      value: (any string)
    Feature: difficultToAnnotate
      value: false true


    discourseMarker
    ---------------------
    Feature: comments
      value: (any string)
    Feature: difficultToAnnotate
      value: false true
    Feature: responsiveDM
      value: false true


    explicitEditingTerm
    ---------------------
    Feature: comments
      value: (any string)
    Feature: difficultToAnnotate
      value: false true


    filledPause
    ---------------------
    Feature: comments
      value: (any string)
    Feature: difficultToAnnotate
      value: false true


    ipLeft
    ---------------------
    None


    ipRight
    ---------------------
    None


    noRTMetadata
    ---------------------
    Feature: comments
      value: (any string)


    questionableTranscription
    ---------------------
    Feature: comments
      value: (any string)

    
(5) Low level, structural features of annotations
    
    These are features that hold structural information of annotations.
    If the RT API is used, there is no need to access these features,
    thus they are hidden from the user.
    
    _sn         A serial number.  The larger the number, the later
                located the annotation in chronological order.
    
    _st         A pointer to the first token child.
    
    _et         A pointer to the last token child.
    
    _next       A pointer to the next token within the segment.  Only
                available for tokens.  If the value is "", then the token
                is the last one in the segment.

    _type*      A parental pointer.
                Examples: _speaker*, _section*, _turn*, _SU*, _ipLeft*, etc.

    
(6) RT API functions by category

    This API has been developed to alleviate the burden of processing
    the low level details of the file format, thus making the data
    production more efficient.

    These are current list of functions, but it will keep changing
    until things are stabilized.

    * File I/O
    
           fileId load_transcript_swb(filename)
           fileId load_transcript_hub4(filename)
           fileId load(filename)
                  save(fileId, filename)
                  save_ag(fileId, filename)
                  save_cag(fileId, filename)
                  save_cagz(fileId, filename)
           string save_format(fileId)

    * Navigation, querying
    
  list<channelId> channels(fileId)
  list<speakerId> speakers(channelId)
  list<sectionId> sections(fileId)
     list<turnId> section_turns(sectionId)
     list<turnId> section_turns(sectionId, channelId)
         list<id> children(id, type)
    list<tokenId> tokens(id)
    list<tokenId> su_tokens(id)
               id parent(id, type)
        channelId channel(id)
           fileId file(id)
          tokenId next_token(id)
          tokenId prev_token(id)
          tokenId first_token(id)
          tokenId last_token(id)
             bool has_start_offset(id)
             bool has_end_offset(id)
           double start_offset(id)
           double end_offset(id)
           string type(id)
           string feature(id, feature)
             bool exists_feature(id, feature)
      set<string> feature_names(feature)
          set<id> by_feature(channelId, feature, value, type)
          set<id> by_feature(channelId, feature, value)
    
    * Modification
    
                  set_feature(id, feature, value)
                  set_features(id, map<feature,value> fv_pairs)
                  delete_feature(id, feature)
                  delete_file(fileId)
                  delete_ann(id)
             bool set_start_offset(id, offset)
             bool set_end_offset(id, offset)

    * Construction
    
           fileId create_file(fileId)
        channelId create_channel(fileId)
        speakerId create_speaker(channelId)
        sectionId add_section(fileId)
           turnId add_turn(speakerId)
        segmentId add_segment(speakerId)
          tokenId add_token(speakerId)
       morphemeId add_morpheme(tokenId)
               id create_ann(tkn1,tkn2, type)
               id create_ann_on_channel(channelId, start, end, type)
               id create_ann_on_channel_ss(channelId, ann1, ann2, type)
               id create_ann_on_channel_se(channelId, ann1, ann2, type)
               id create_ann_on_channel_es(channelId, ann1, ann2, type)
               id create_ann_on_channel_ee(channelId, ann1, ann2, type)
             bool copy(id, id)
                  disable_type_feature_check()
                  enable_type_feature_check()