Fisher English Training Transcript Data, Part 1 -- LDC2004T19
=============================================================

This corpus represents the first half of a collection of
conversational telephone speech (CTS) that was created at the LDC
during 2003.  It contains transcript data for 5850 complete
conversations, each lasting up to 10 minutes.  In addition to the
transcriptions, which are found under the "trans" directory, there is
a complete set of tables describing the speakers, the properties of
the telephone calls, and the set of topics that were used to initiate
the conversations.

The second half of the collection (Fisher English Training Transcript
Data, Part 2) will be released by the LDC in 2005, and will be
organized as an extension of the Part 1 corpus.  Taken as a whole, the
two parts comprise 11,699 recorded telephone conversations.

Data collection and transcription were sponsored by DARPA and the
U.S. Department of Defense, as part of the EARS project for research
and development in automatic speech recognition.


Properties and Types of transcript files
----------------------------------------

Overall, about 12% of the conversations were transcribed at the LDC,
and the rest were done by BBN and WordWave, using a significantly
different approach to the task.  A central goal in both sets was to
maximize the speed and economy of the transcription process, and this
in turn involved certain aspects of mark-up detail and quality control
that may have been common in previous, smaller corpora.

The LDC transcripts were based on automatic segmentation of the audio
data, to identify the utterance end-points on both channels of each
conversation.  Given these time stamps, manual transcription was
simply a matter of typing in the words for each segment and doing a
rudimentary spell-check.  No attempt was made to modify the
segmentation boundaries manually, or to locate utterances that the
segmenter might have missed.  Portions of speech where the transcriber
could not be sure exactly what was said were marked with double
parentheses -- " (( ... )) " -- and the transcriber could hazard a
guess as to what was said, or leave the region between parentheses
blank.  The LDC transcription process yields one plain-text transcript
file per conversation, in which the first two lines show the call-ID
and the fact that the transcript was done at the LDC; the remainder of
the file contains one utterance per line (with blank lines separating
the utterances), with the start-time, end-time, speaker/channel-ID and
utterance text.  For example, here are the first few lines of an LDC
transcript file:


# fe_03_00001.sph
# Transcribed at the LDC

3.76 5.54 A: and i generally prefer 

5.82 6.48 A: eating at home 

7.92 9.52 B: hi my name is andy 


The time stamps are expressed in seconds from the beginning of the
audio file; the speaker/channel-ID is either "A:" (for channel 1) or
"B:" (for channel 2); the remaining text is mono-case with no
syntactic punctuation.

The BBN/WordWave approach involved producing a complete manual
transcription of the two-channel conversation first, without assigning
any time stamps to the utterances.  Then the transcription text and
audio data were processed through an automatic speech recognition
system, to do forced alignment of the text with the audio and assign
time stamps to utterances.  (The process is explained in more detail
in the file "bbn_trans_readme.txt", in the "doc" directory.)  The end
result was a set of five text files for each conversation: the
original manual transcript (with no time stamps), and four separate
outputs from the force-alignment process.  Taken together, the latter
four files contain roughly the same extent of information as the LDC
transcript: time stamps and text for most if not all utterances in the
conversation.

For the current publication, the BBN transcription files are being
provided in two forms: the original structure of five text files per
conversation, and the single-file LDC format.  The "bbn_orig"
directory contains just the transcript data created by BBN, while the
"trans" directory contains the union of BBN and LDC transcripts
(i.e. the entire corpus), rendered entirely in the LDC style.  (Calls
that were transcribed by the LDC will not be found under the
"bbn_orig" directory.)

A major division of transcript data in the BBN structure was between
"auto-segmented" and "rejected" segments (cf. "bbn_trans_readme.txt").
The former showed a good match between manual transcription and forced
alignment recognition, whereas the latter did not.  In coercing these
files into the LDC transcript format, utterances from the "rejected"
set were marked with double parentheses around the utterance text, to
reflect the uncertainty suggested by low alignment scores.  To
illustrate, the following excerpt shows a couple of utterances that
come from the "auto-segmented" set, and one utterance that comes from
the "rejected" set:


# fe_03_00092.sph
# Transcribed by BBN/WordWave

0.00 1.21 B: hello 

0.73 1.93 A: hello 

...

12.39 14.07 B: (( oh yeah )) 


The original files created by the BBN process did not use double
parentheses at all, hence all double-parens in this set of derived
transcript files (under the "trans" directory) have been introduced by
the transformation into LDC format.


File names and directory structure
----------------------------------

Each file is identified by a conversation-ID in the following form:

  fe_03_nnnnn

where "nnnnn" is a sequential number starting at 00001; the numeric
sequence simply represents the relative order in which the calls were
recorded; "fe" refers to "Fisher English" (similar CTS collections
were done in Arabic and Mandarin Chinese); the "_03_" refers to the
2003 collection phase (a follow-on collection, to be published later,
was conducted in 2004).

In the "trans" directory, all file names have a final ".txt"
extension, indicating the plain-text LDC transcript format (in the
companion corpus of audio data, each speech file uses the same file-ID
as the corresponding transcript file, and has a ".sph" extension).

In order to keep directory sizes more manageable, the calls have been
divided into a series of directories containing subsets of 100 calls
each.  Each directory name is simply the first three digits of the
five-digit conversation-ID number (e.g. "000/fe_03_00001.txt", etc);
these subset directories are found directly under both "trans" and
"bbn_orig" (under "bbn_orig", each "nnn" directory contains the three
file-type directories, "auto-segmented", "reject" and "originals").
In the companion audio corpus, the equivalent series of subset
directories are located under a single "audio" directory.

In addition to the "trans" and "bbn_orig" directories,
the top-level directory also contains a "doc" directory (for
documentation and tables) and an "index.htm" file.


Overview of database tables
---------------------------

The Fisher telephone collection was driven by a relational database
that kept track of speakers, telephone information, and details on
each successful call.  The "doc" directory provides three tables drawn
from this database, together with a text file describing the contents
of each table; these are described briefly below:

1.  fe_03_p1_filelist.tbl (cf. doc_filelist_tbl.txt)

This is a tab-delimited plain-text table, with one row for each call
in the Fisher English Part1 corpus.  The columns indicate the call-ID,
the volume-ID for the DVD in the corresponding Speech corpus that
contains the audio for the call, the gender of the two speakers
involved in the call, and where the call was transcribed.

2.  fe_03_p1_calldata.tbl (cf. doc_calldata_tbl.txt)

This is a comma-separated-value (CSV) plain-text table, with one row
for each call in the Part1 corpus.  (The first row provides column
labels.)  The columns indicate call-ID, date and time of the call,
topic-ID used in the call, and details about the speakers and phones
used.

3.  fe_03_pindata.tbl (cf. doc_pindata_tbl.txt)

This is a CSV plain-text table, with one row for each speaker who
participated in a Fisher English Training call.  (The first row
provides column labels.)  The columns indicate speaker-ID, demographic
information as provided by the speaker, and a list of call-sides in
which the speaker is represented.  (For speakers who occur in more
than one call, the list of calls in the last field are separated by
semi-colons.)

For reasons discussed in the next section, the speaker information
in the "pindata" table might not reflect the properties of the voice
that was actually recorded in a given call.  For example, a particular
speaker-ID might have been assigned to a man, but when the Fisher
collection system dialed out to the phone number for that speaker, a
woman answered and completed the process of recording a conversation.

In order to reduce the amount of uncertainty in the database tables
being presented with the corpus, the "calldata" table presents only
the information derived from manual audits of each call.  (The audit
process is described in the next section, and details about how the
audit information is presented in the "calldata" table are described
in the associated "doc" file).  In the "pindata" table, you'll find
the demographic data provided by each speaker during the recruiting
process (including gender); combined with this, in the final field of
"pindata", you'll find the list of calls that involved the PIN
assigned to the speaker, including call-ID, channel (A or B) and the
gender/dialect information from manual audits.  Here is an example of
a row taken from "pindata" that demonstrates a discrepancy between
stated demographics and audit results:

2637,F,54,16,English,WA,2,06972_A/m.a;08175_A/f.a

This row describes speaker PIN 2637, who registered as a 54-year-old
female with 16 years of education, a native speaker of American
English raised in Washington state; two calls were recorded involving
this PIN, one of which (call-ID 06972, channel A) was found to contain
a male voice (also a native speaker of American English).  We can't
infer any other demographic information about the male speaker (except
his approximate age as determined by manual audit, to be found in the
"calldata" table for call-ID 06972); conceivably, the female speaker
in call-ID 08175 (channel A) might also be different from the
54-year-old woman who registered to participate in the collection, but
it's probably safe to assume that most of the demographic data is
applicable, given the matching gender.

Note that the "pindata" table contains information on all the speakers
in the entire Fisher English collection, drawn from all 11,699 calls
comprising both Part 1 (the 2004 release) and Part 2 (the 2005
release).  As a result, half of the call-ID's referenced in this table
will not be found in the Part 1 corpus.  The "calldata" and "filelist"
tables reflect only the contents of the Part 1 corpus.


Method of data creation
-----------------------

The telephone calls were recorded digitally from a T-1 trunk line that
terminates at a host computer at the LDC (the "robot-operator").  Over
12,000 speakers had been recruited from around the United States, both
native and non-native speakers of English.  The vast majority of
recruits submitted their demographic and contact information to the
LDC via a specialized Fisher enrollment page on the LDC's web server,
in response to a wide range of advertising and announcements, both on
commercial media and through various Internet channels.

The enrollment form requested age, sex, and geographic background, as
well as information about the telephone(s) to be used by each recruit;
a scheduling grid was provided to allow recruits to indicate hours and
days of the week when they would be able to accept calls from the
robot operator.  A staff of recruiters at the LDC reviewed each of the
enrollment forms submitted via the web, and also fielded phone calls
and email from other prospective recruits.  Enrollment information
went into a relational database for tracking subjects, phones, and all
call activity on the robot operator.

The T-1 telephone circuit dedicated to Fisher English collection was
configured so that some lines would service people who dialed in to
the system while other lines would be used for dialing out to people
according to their hours of availability, as provided in the
enrollment process.

During the hours when people had indicated availability to receive
calls, the robot operator queried the database continuously for
available callees and dialed out to multiple people simultaneously,
while also accepting dial-ins.  Whenever any two active lines (dial-in
or dial-out) reached a point where the callees were ready to proceed
with a conversation, the robot operator bridged the two lines,
announced the topic of the day to both parties, and began recording by
copying the digital mu-law sample data from each line directly to disk
files.

At enrollment, each recruit was assigned a unique PIN; subjects who
dialed in to the robot operator were required to supply this PIN
before being accepted for a recording.  However, each time the system
dialed out to a specific PIN selected from the database, the person
answering the phone could accept the call, and proceed with recording a
conversation, without supplying a PIN to verify his or her identity.

Avoiding PIN verification on dial-outs was viewed as a worthwhile
trade-off.  It would introduce some amount of uncertainty regarding
speaker characteristics in the recorded calls: the person to whom a
given PIN was assigned might not be the one who was actually recorded
in a given call when that PIN was selected for dial-out; but the
number of calls where the actual speaker was not the "expected"
speaker would be relatively small.  Meanwhile, the likelihood would be
relatively high that a large proportion of subjects would not be able
to find or recall their assigned PIN's when they received a call from
the robot operator, and this would severely limit the success of the
Fisher collection strategy.  The barriers and complications that would
result from PIN verification problems on dial-outs would have been
unmanageable, owing to the large number of people involved (over
14,000 recruits were enrolled over the course of the collection), the
necessarily sparse communications between these people and the LDC
recruiting staff, and the fact that a maximum of three successful
calls would be collected for each PIN.

On each day of call collection a single topic of discussion was chosen
sequentially from a set of 40 topics prepared in advance, and in all
calls conducted on a given day, all callees would be presented with
the same topic.  After each cycle of 40 days, the same topic list was
repeated.  For the most part, people tended to adhere to the suggested
topic in their conversations.

After calls were recorded, automatic utterance segmentation was
applied to both channels of every file.  If the segmentation did not
yield at least 5 minutes of detected speech, the call was rejected
from further consideration.  Otherwise, up to four segments of
approximately 30 seconds each were automatically selected at intervals
throughout each call, and these segments were used for manual audit.

LDC staff listened to at least two segments from every call, and the
following audit judgments were recorded in the central Fisher
database:

 For each channel/speaker:
    - sex
    - approximate age (young adult, adult or senior)
    - native speaker of American English (or not)

 For each 30-second segment:
    - relative signal quality (poor, fair, good)
    - relative conversation quality (poor, fair, good)

In assessing conversation quality, a segment was marked "poor" if
people were talking about the Fisher project (as opposed to some other
topic), or if people were having a hard time coming up with things to
say.  Auditors did not have access to information about what the
assigned topic was for each call, and so could not judge whether
speakers were adhering to the assigned topic, but if the segments
showed an engaged discussion about anything other than the Fisher
project, it was marked "good" for conversation quality.

The file "doc_calldata_tbl.txt" describes how the audit information is
presented in the associated database table file.

David Graff
Linguistic Data Consortium
December, 2004