TDF Format for LDC Transcripts The TDF (tab delimited format) used by LDC is a simple file format in which data is represented as a set of records, which are in turn a set of fields separated by tab characters. Records may have different types as with the RTTM format [1]. This means two things: 1) two different records may have different number of fields, and 2) two fields, each belonging to a different record, may have different meanings, e.g date vs. name. The TDF format for LDC transcripts is a set of 13-field records plus some meta-information. The 13-field record is also called segment, and the fields are listed below. 1 file file name or ID unicode 2 channel audio channel int 3 start start time float 4 end end time float 5 speaker speaker name or ID unicode 6 speakerType speaker type unicode 7 speakerDialect speaker dialect unicode 8 transcript transcript unicode 9 section section ID int 10 turn turn ID int 11 segment segment ID int 12 sectionType section type unicode 13 suType SU type unicode where SU stands for "sentence unit". Depending on the genre of data, some fields may not contain useful information and may even be blank. In addition to the body of segments, there are a few lines of meta-information. The first line of the format declares the above field specification for segments in the following form. file;unicode channel;int start;float ... The second and third lines specifies where the "real" section boundaries are and what types they are. For example, ;;MM sectionTypes [u'report', u'nontrans', None] ;;MM sectionBoundaries [0.0, 425.3, 586.3, 9999999.0] means that the first section starts at 0.0 seconds and its type is "report" and the second section starts at 425.3 seconds and its type is "nontrans" and this is the last section (last ones are ignored). These two lines are optional. In fact, some transcripts such as telephone speech transcripts do not include this information. Besides these, comments are allowed anywhere in the file except for the first line. A comment line starts with ";;" and does not affect the file's data integrity although lines starting with ";;MM" might have meaningful information as shown in the example above. References: RTTM Format, http://www.nist.gov/speech/tests/rt/rt2003/fall/docs/RTTM- format-v13.pdf