TDF Format for LDC Transcripts Tab Delimited Format (TDF) is a simple file format in which data is represented as a set of "records", which are in turn a set of "fields" separated by tab characters. Records may have different types as in the RTTM format [1]. This means two things: 1) two differnt records may have different number of fields, and 2) two fields, each belonging to a different record, may have different meanings, e.g date vs. name. The TDF format for LDC transcripts is a set of 13-field records plus some meta-information. The 13-field record is also called segment, and the fields are listed below. 1 file file name or id string 2 channel audio channel number 3 start start time number 4 end end time number 5 speaker speaker name or id string 6 speakerType speaker type string 7 speakerDialect speaker dialect string 8 transcript transcript string 9 section section id number 10 turn turn id number 11 segment segment id number 12 sectionType section type string 13 suType SU type string In addition to the body of segments, there are a few lines of meta-information. The first line of the format declares the above field specification for segments in the following form. file;unicode channel;int start;float ... The second and third lines specifies where the "real" section boundaries are and what types they are. For example, ;;MM sectionTypes [u'report', u'nontrans', None] ;;MM sectionBoundaries [0.0, 425.3, 586.3, 9999999.0] means that the first section starts at 0.0 second and its type is "report" and the second section starts at 425.3 seconds and its type is "nontrans" and this is the last section (last ones are ignored). This two lines are optional. In fact, some transcripts such as telephone speech transcripts don't include this section information. Besides these, comments are allowed anywhere in the file except for the first line. A comment line starts with ";;" and ignoring this line doesn't cause any problem to the data integrity with one exception that lines starting with ";;MM" might have meaningful information as shown in the example above. References: RTTM Format, http://www.nist.gov/speech/tests/rt/rt2003/fall/docs/RTTM-format-v13.pdf