Switchboard 1, Release 2

		      Relational Database Tables
			     Version 2.0


This is a revised version of the original relational database tables
that accompany the Switchboard 1 corpus.  The original version of
these tables had been produced on CD-ROM in 1993 by NIST as part of
the original distribution of speech and transcripts, and had been
redistributed by LDC as part of its "Intermediate Release" of
Switchboard transcripts between 1994 and 1997.

The LDC produced a "Release 2" version of the complete speech data
collection on CD-ROM in 1997, which did not include transcripts or
database tables.  This collection is now available from the LDC corpus
catalog (http://www.ldc.upenn.edu/Catalog/, catalog number LDC97S62).

Also, the Institute for Signal and Information Processing (ISIP) at
Mississippi State University undertook a complete review and revision
of the transcript collection, using the "Release 2" CD-ROM set of
speech data as the point of reference.  Numerous corrections and
improvements were made in the content and segmentation of the speaker
turns.  The complete set of revised transcripts is now available for
free via ftp from the ISIP web site:

    http://www.isip.msstate.edu/projects/switchboard/

But this "re-transcription" project did not extend to checking or
correcting the database tables, and the tables are not included in the
ISIP distribution of transcripts.

It has long been known that there were problems in the original
version of the tables.  First, the tables referred to numerous calls
that were not part of the published speech and transcript releases.
Second, a total of seven calls for which speech and transcripts were
published in "Release 2" were not mentioned in the tables.

Third, and most serious, a number of entries in the "conv" and
"call_con" tables were found to be wrong in terms of the identities
that were given for the speakers.  Nearly 200 of the published calls
were affected by this problem (nearly 10% of the 2438 calls published
in the "Release 2" version of speech and transcripts).  In most cases,
the errors in the tables involved a reversal of the "Speaker A" and
"Speaker B" identities; in a few cases, the speaker-id number was
simply wrong (e.g. caused by transposition or substitution of digits).

In order to correct the speaker identification errors, the LDC
undertook a project in December 2000 to re-audit the complete set of
calls in the "Release 2" collection.  Auditors reviewed the data one
speaker at a time; for each speaker in the corpus, all conversation
sides attributed to that speaker were presented to the auditor, who
then listened to segments from each side to determine whether the same
voice was heard in each one.  (There are relatively few speakers in
the collection who completed only one call, and these were not
reviewed.)  If the auditor heard a different voice in one of the
conversation sides, s/he had the option of checking the opposite side
of the affected call, to determine if the target voice was present
there; if so, the auditing table was updated to reflect the correct
channel assignment for the target speaker; if the target voice was not
present in either channel of the call, the original speaker-id entry
was revised to "UNKNOWN", and the call was reviewed more carefully at
a later stage.

This auditing process repaired all the speaker-id errors due to
channel inversion.  With help from George Doddington and Alvin Martin
at NIST, the remaining cases of "UNKNOWN" speaker-ids were also
resolved, and the current release of the tables provides correct
speaker identification data for all 2438 calls in the "Release 2"
corpus.

In the seven cases where calls in Release 2 were not mentioned in the
original tables, some ancillary information was not recoverable from
available sources; in particular, some or all of the following pieces
are missing:

    - the topic ("IVI") number assigned during the call
    - the phone numbers used by the callers
    - judgments of the original TI transcribers regarding the overall
		quality of the call

The affected calls are listed here:

3178
3199
3217
3243
3248
3321
3564

In the "call_con" and "conv" table entries for these calls, the string
"UNK" is inserted as a place holder for each field that is missing.
In the "rating" table, which contains the transcriberjudgements, there
are no entries for these calls.

Another side effect of the re-audit project was the (re)discovery of a
few problems affecting the "Release 2" CD-ROM speech publication:


(1) Three speech files were inadvertently omitted from the CD-ROM set;
    these are now available via anonymous ftp:

    ftp://ftp.ldc.upenn.edu/pub/ldc/data_samples/swb1_r2_sph_patch.tar

    The three files involved are 2289, 4361, 4379


(2) In one speech file, 3243, the "B" channel was found to be an
    identical copy of the "A" channel, and this problem dates back to
    the original delivery from TI -- i.e. the original sample stream
    for the "B" channel has never been available.  (However, the
    utterances of the "B" speaker are marginally audible as echo in
    the two copies provided of the "A" channel, and speaker B's
    utterances have been transcribed as fully as possible.)


David Graff
LDC
Feb. 26, 2001