Switchboard 1, Release 2 Relational Database Tables Version 2.0 This is a revised version of the original relational database tables that accompany the Switchboard 1 corpus. The original version of these tables had been produced on CD-ROM in 1993 by NIST as part of the original distribution of speech and transcripts, and had been redistributed by LDC as part of its "Intermediate Release" of Switchboard transcripts between 1994 and 1997. The LDC produced a "Release 2" version of the complete speech data collection on CD-ROM in 1997, which did not include transcripts or database tables. This collection is now available from the LDC corpus catalog (http://www.ldc.upenn.edu/Catalog/, catalog number LDC97S62). Also, the Institute for Signal and Information Processing (ISIP) at Mississippi State University undertook a complete review and revision of the transcript collection, using the "Release 2" CD-ROM set of speech data as the point of reference. Numerous corrections and improvements were made in the content and segmentation of the speaker turns. The complete set of revised transcripts is now available for free via ftp from the ISIP web site: http://www.isip.msstate.edu/projects/switchboard/ But this "re-transcription" project did not extend to checking or correcting the database tables, and the tables are not included in the ISIP distribution of transcripts. It has long been known that there were problems in the original version of the tables. First, the tables referred to numerous calls that were not part of the published speech and transcript releases. Second, a total of seven calls for which speech and transcripts were published in "Release 2" were not mentioned in the tables. Third, and most serious, a number of entries in the "conv" and "call_con" tables were found to be wrong in terms of the identities that were given for the speakers. Nearly 200 of the published calls were affected by this problem (nearly 10% of the 2438 calls published in the "Release 2" version of speech and transcripts). In most cases, the errors in the tables involved a reversal of the "Speaker A" and "Speaker B" identities; in a few cases, the speaker-id number was simply wrong (e.g. caused by transposition or substitution of digits). In order to correct the speaker identification errors, the LDC undertook a project in December 2000 to re-audit the complete set of calls in the "Release 2" collection. Auditors reviewed the data one speaker at a time; for each speaker in the corpus, all conversation sides attributed to that speaker were presented to the auditor, who then listened to segments from each side to determine whether the same voice was heard in each one. (There are relatively few speakers in the collection who completed only one call, and these were not reviewed.) If the auditor heard a different voice in one of the conversation sides, s/he had the option of checking the opposite side of the affected call, to determine if the target voice was present there; if so, the auditing table was updated to reflect the correct channel assignment for the target speaker; if the target voice was not present in either channel of the call, the original speaker-id entry was revised to "UNKNOWN", and the call was reviewed more carefully at a later stage. This auditing process repaired all the speaker-id errors due to channel inversion. With help from George Doddington and Alvin Martin at NIST, the remaining cases of "UNKNOWN" speaker-ids were also resolved, and the current release of the tables provides correct speaker identification data for all 2438 calls in the "Release 2" corpus. In the seven cases where calls in Release 2 were not mentioned in the original tables, some ancillary information was not recoverable from available sources; in particular, some or all of the following pieces are missing: - the topic ("IVI") number assigned during the call - the phone numbers used by the callers - judgments of the original TI transcribers regarding the overall quality of the call The affected calls are listed here: 3178 3199 3217 3243 3248 3321 3564 In the "call_con" and "conv" table entries for these calls, the string "UNK" is inserted as a place holder for each field that is missing. In the "rating" table, which contains the transcriberjudgements, there are no entries for these calls. Another side effect of the re-audit project was the (re)discovery of a few problems affecting the "Release 2" CD-ROM speech publication: (1) Three speech files were inadvertently omitted from the CD-ROM set; these are now available via anonymous ftp: ftp://ftp.ldc.upenn.edu/pub/ldc/data_samples/swb1_r2_sph_patch.tar The three files involved are 2289, 4361, 4379 (2) In one speech file, 3243, the "B" channel was found to be an identical copy of the "A" channel, and this problem dates back to the original delivery from TI -- i.e. the original sample stream for the "B" channel has never been available. (However, the utterances of the "B" speaker are marginally audible as echo in the two copies provided of the "A" channel, and speaker B's utterances have been transcribed as fully as possible.) David Graff LDC Feb. 26, 2001