Fisher English Training Speech Data, Part 2 -- LDC2005S13 ========================================================= This corpus represents the secpnd half of a collection of conversational telephone speech (CTS) that was created at the LDC during 2003. It contains 5849 audio files, each one containing a full conversation of up to 10 minutes. Additional information regarding the speakers involved, and types of telephones used, can be found in the companion text corpus of transcripts (Fisher English Training Text Data, Part 2 -- LDC2005T19). The first half of the collection (Fisher English Training Speech Data, Part 1) was released by the LDC in 2004 (LDC2004S13 for speech data, LDC2004T19 for transcripts). Taken as a whole, the two parts comprise 11,699 recorded telephone conversations. The individual audio files are presented in NIST SPHERE format, and contain two-channel mu-law sample data; "shorten" compression has been applied to all files. Data collection and transcription were sponsored by DARPA and the U.S. Department of Defense, as part of the EARS project for research and development in automatic speech recognition. File names and directory structure ---------------------------------- Each audio file is identified by a conversation-ID in the following form: fe_03_nnnnn where "nnnnn" is a sequential number starting at 05851 (the Part 1 corpus had ID numbers from 00001 to 05850); the numeric sequence simply represents the relative order in which the calls were recorded; "fe" refers to "Fisher English" (similar CTS collections were done in Arabic and Mandarin Chinese); the "_03_" refers to the 2003 collection phase (a follow-on collection, to be published later, was conducted in 2004). All audio file names have a final ".sph" extension, indicating the NIST SPHERE file format (in the companion corpus of text data, each transcript file uses the same file-ID as the corresponding speech file, and has a ".txt" extension). In order to keep directory sizes more manageable, the calls have been divided into a series of directories containing subsets of 100 files each. Each directory name is simply the first three digits of the five-digit conversation-ID number (e.g. "058/fe_03_05851.sph", etc), and these subset directories are located under a single "audio" directory (in the companion text corpus, the equivalent series of subset directories are located under a single "trans" directory). In addition to the "audio" directory containing the speech data files, the top-level directory of each DVD in this set also contains the following: - this same README file, - a "volume_id.txt" file, containing the current DVD voume-ID - a copy of "volume_id.txt" whose name is the current DVD volume-ID - a complete listing of the Part 2 file inventory: "filetable2.txt" The two files that provide the disc volume-ID are included for convenience, to make it easier to determine which disc in the set is being viewed. The "filetable2.txt" file is a space-delimited plain-text table that provides the following information for all calls in the Part 2 corpus (one text line per call): Col.# Contents --------------------------------------------------------- 1 DVD volume-ID (fe_03_p2_sphN, for N = 1 .. 7) 2 file name (fe_03_nnnnn.sph, for nnnnn = 05851 .. 11699) 3 A/B spkr sex (mm, mf, fm, ff) This is a slightly modified version of the "filelist2.tbl" file that is provided with the companion text corpus of transcript data; it covers only the 5849 files that make up Part 2. The speaker sex information is based on manual audits of the audio files. Method of data creation ----------------------- The telephone calls were recorded digitally from a T-1 trunk line that terminates at a host computer at the LDC (the "robot-operator"). Over 12,000 speakers had been recruited from around the United States, both native and non-native speakers of English. The vast majority of recruits submitted their demographic and contact information to the LDC via a specialized Fisher enrollment page on the LDC's web server, in response to a wide range of advertising and announcements, both on commercial media and through various internet channels. The enrollment form requested age, sex, and geographic background, as well as information about the telephone(s) to be used by each recruit; a scheduling grid was provided to allow recruits to indicate hours and days of the week when they would be able to accept calls from the robot operator. A staff of recruiters at the LDC reviewed each of the enrollment forms submitted via the web, and also fielded phone calls and email from other prospective recruits. Enrollment information went into a relational database for tracking subjects, phones, and all call activity on the robot operator. The T-1 telephone circuit dedicated to Fisher English collection was configured so that some lines would service people who dialed in to the system while other lines would be used for dialing out to people according to their hours of availability, as provided in the enrollment process. During the hours when people had indicated availability to receive calls, the robot operator queried the database continuously for available callees and dialed out to multiple people simultaneously, while also accepting dial-ins. Whenever any two active lines (dial-in or dial-out) reached a point where the callees were ready to proceed with a conversation, the robot operator bridged the two lines, announced the topic of the day to both parties, and began recording by copying the digital mu-law sample data from each line directly to disk files. At enrollment, each recruit was assigned a unique PIN; subjects who dialed in to the robot operator were required to supply this PIN before being accepted for a recording. However, each time the system dialed out to a specific PIN selected from the database, the person answering the phone could accept the call, and proceed with recording a conversation, without supplying a PIN to verify his or her identity. Avoiding PIN verification on dial-outs was viewed as a worthwhile trade-off. It would introduce some amount of uncertainty regarding speaker characteristics in the recorded calls: the person to whom a given PIN was assigned might not be the one who was actually recorded in a given call when that PIN was selected for dial-out; but the number of calls where the actual speaker was not the "expected" speaker would be relatively small. Meanwhile, the likelihood would be relatively high that a large proportion of subjects would not be able to find or recall their assigned PIN's when they received a call from the robot operator, and this would severely limit the success of the Fisher collection strategy. The barriers and complications that would result from PIN verification problems on dial-outs would have been unmanageable, owing to the large number of people involved (over 14,000 recruits were enrolled over the course of the collection), the necessarily sparse communications between these people and the LDC recruiting staff, and the fact that a maximum of three successful calls would be collected for each PIN. On each day of call collection a single topic of discussion was chosen sequentially from a set of 40 topics prepared in advance, and in all calls conducted on a given day, all callees would be presented with the same topic. After each cycle of 40 days, the same topic list was repeated. For the most part, people tended to adhere to the suggested topic in their conversations. After calls were recorded, automatic utterance segmentation was applied to both channels of every file. If the segmentation did not yield at least 5 minutes of detected speech, the call was rejected from further consideration. Otherwise, up to four segments of approximately 30 seconds each were automatically selected at intervals throughout each call, and these segments were used for manual audit. LDC staff listened to at least two segments from every call, and the following audit judgments were recorded in the central Fisher database: For each channel/speaker: - sex - approximate age (young adult, adult or senior) - native speaker of American English (or not) For each 30-second segment: - relative signal quality (poor, fair, good) - relative conversation quality (poor, fair, good) In assessing conversation quality, a segment was marked "poor" if people were talking about the Fisher project (as opposed to some other topic), or if people were having a hard time coming up with things to say. Auditors did not have access to information about what the assigned topic was for each call, and so could not judge whether speakers were adhering to the assigned topic, but if the segments showed an engaged discussion about anything other than the Fisher project, it was marked "good" for conversation quality. Complete tabulations of audit results, together with the speaker demographics and the PIN assignments for each call-side, are provided in the companion text corpus of transcript data. David Graff Linguistic Data Consortium March 2005