TDF Format for LDC Transcripts

Revision: 1.4 (2010-09-29)

TDF Format (general description)
--------------------------------

Tab Delimited Format (TDF) is a simple file format in which data is
represented as a set of "records", which are in turn a set of "fields"
separated by tab characters.  Records may have different types as in the
RTTM format [1].  This implies two things: 1) two differnt records may
have different number of fields, and 2) two fields with same column index,
each belonging to a different record, may have different meanings, e.g
date vs. name.


TDF Format for LDC Transcripts
------------------------------

The TDF format for LDC transcripts is a set of 13-field records plus
some meta-information. This format was originally designed for use with LDC's
new transcription tool XTrans. The 13-field record is also called segment, and
all segments in the file are identical.  The 13 fields are listed below.

	 1 file			file name or id		string
	 2 channel		audio channel		number
	 3 start		start time		number
	 4 end			end time		number
	 5 speaker		speaker name or id	string
	 6 speakerType		speaker type		string
	 7 speakerDialect	speaker dialect		string
	 8 transcript		transcript		string
	 9 section		section id		number
	10 turn			turn id			number
	11 segment		segment id		number
	12 sectionType		section type		string
	13 suType		SU type			string

Note that the section ID, turn ID and segment ID fields can be used in an
application specific way. For instance, XTrans outputs some values for these
fields, but no assumption should be made on them.

In addition to the body of segments, there are a few lines of
meta-information.  The first line of the file declares the field
specification for segments (shown above) in the following format.

	file;unicode	channel;int	start;float	...

The second and third lines specifies where the "real" section boundaries
are and what types they are.  For example,

	;;MM sectionTypes	[u'report', u'nontrans', None]
	;;MM sectionBoundaries	[0.0, 425.3, 9999999.0]

means that the first section starts at 0.0 second and its type is
"report" and the second section starts at 425.3 seconds and its type is
"nontrans" and this is the last section (last ones are ignored).

These two lines are optional.  In fact, some transcripts such as
telephone speech transcripts may not include this section information.

Besides these, comments are allowed anywhere in the file except for the
first line.  A comment line starts with ";;" and ignoring this line
doesn't cause any problem to the data integrity with one exception that
lines starting with ";;MM" might have meaningful information as shown in
the section boundary example above.


BNF Style Definition of TDF for LDC Transcripts
-----------------------------------------------

LDCTDF ::= HEADER META BODY

HEADER ::= "file;unicode" TAB "channel;int" TAB "start;float" TAB
           "end;float" TAB "speaker;unicode" TAB "speakerType;unicode" TAB
           "speakerDialect;unicode" TAB "transcript;unicode" TAB
           "section;int" TAB "turn;int" TAB "segment;int" TAB
           "sectionType;unicode" TAB "suType;unicode" NL

META ::= META-LINE | META META-LINE
META-LINE ::= ";;MM" SPC META-NAME TAB META-VALUE NL
META-NAME ::= STR
META-VALUE ::= STR-QUOT | STR-U | INT | FLOAT | LIST

BODY ::= SEGMENT | COMMENT | BODY SEGMENT | BODY COMMENT
SEGMENT ::= CELL-FILE TAB CELL-CH TAB CELL-START TAB CELL-END TAB
            CELL-SPKR TAB CELL-TYP TAB CELL-DIAL TAB CELL-TRANS TAB
            CELL-SEC TAB CELL-TRN TAB CELL-SEG TAB
            CELL-SECTYP TAB CELL-SUTYP NL
COMMENT ::= ";;" NON-MM-STRING NL
CELL-FILE   ::= STR
CELL-CH     ::= INT
CELL-START  ::= FLOAT
CELL-END    ::= FLOAT
CELL-SPKR   ::= STR
CELL-TYP    ::= STR
CELL-DIAL   ::= STR
CELL-TRANS  ::= STR
CELL-SEC    ::= INT
CELL-TRN    ::= INT
CELL-SEG    ::= INT
CELL-SECTYP ::= STR
CELL-SUTYP  ::= STR

LIST ::= "[" LIST-BODY "]"
LIST-BODY ::= STR-QUOT | STR-U | INT | FLOAT |
              LIST-BODY "," STR-QUOT |
              LIST-BODY "," STR-U |
              LIST-BODY "," INT |
              LIST-BODY "," FLOAT

STR ::= (a string that doesn't contain a tab character nor a newline
        character)
STR-QUOT ::= (same as STR except that it is double-quoted (single-quoted)
        and any double (single) quotation mark within the beginning and
        ending double-quotes (single-quotes) should be escaped with '\')
STR-U ::= (same as STR-QUOT except that it begins with 'u')
INT ::= (a string representing an integer)
FLOAT ::= (a string representing a float point number)
NON-MM-STRING ::= (any string that doesn't start with "MM " and doesn't
        contain a newline character)
SPC ::= (space character)
TAB ::= (tab character)
NL ::= (newline character which may be OS-dependent)


References
----------

RTTM Format, http://www.nist.gov/speech/tests/rt/rt2003/fall/docs/RTTM-format-v13.pdf