----------------------------------------------------------- Description of the CallFriend telephone speech corpus for American English ----------------------------------------------------------- July, 1997 CONTENTS 1. Summary abstract 2. Data acquisition 3. Data verification 4. Speaker demographics 5. Dialect Audit ----------------------------------------------------------------------- 1. Summary abstract The CallFriend American English corpus of telephone speech was collected by the Linguistic Data Consortium primarily in support of the project on Language Identification (LID), sponsored by the U.S. Department of Defense. This release of the CallFriend American English corpus consists of 60 unscripted telephone conversations between native speakers of English for each dialect group. The recorded conversations last up to 30 minutes. All speakers were aware that they were being recorded. They were given no guidelines concerning what they should talk about. Once a caller was recruited to participate, he/she was given a free choice of whom to call. Most participants called family members or close friends. All calls originated in the United States. ----------------------------------------------------------------------- 2. Data acquisition Speakers were solicited by the LDC to participate in this telephone speech collection effort via the internet, publications (advertisements), and personal contacts. A total of 100 call originators were found per dialect, each of whom placed a telephone call via a toll-free robot operator maintained by the LDC. Access to the robot operator was possible via a unique Personal Identification Number (PIN) issued by the recruiting staff at the LDC when the caller enrolled in the project. The participants were made aware that their telephone call would be recorded, as were the call recipients. The call was allowed only if both parties agreed to being recorded. Each caller was allowed to talk up to 30 minutes. Upon successful completion of the call, the caller was paid $20 (in addition to making a free long-distance telephone call). Each caller was allowed to place only one telephone call. ----------------------------------------------------------------------- 3. Data verification After a successful call was completed, a human audit of each telephone call was conducted to verify that the proper language was spoken, and to check the quality of the recording. The information from this audit may be found in the file "callinfo.tbl", and its contents are described in greater detail in "callinfo.txt". ----------------------------------------------------------------------- 4. Speaker demographics Information on speaker demographics can be found in the file "spkrinfo.tbl", whose contents are described in the file "spkrinfo.txt". ----------------------------------------------------------------------- 5. Dialect Audit A second audit was conducted by a native speaker familiar with dialect variation in American English. Conversations were labeled as either "southern" or "non-southern" based on particular attributes in the speech of the participants. Except as noted below, the files presented in each dialect category are those in which the participants on both sides of the call used the same dialect. Callers in the "southern" collection of CallFriend American English were identified primarily on the basis of vowel quality patterns that are common among native speakers raised in the southeastern United States (from Texas eastward to the Atlantic coast, and from Virginia and Kentucky southward to the Gulf of Mexico). This category also includes a small number of African-American speakers, whose geographic origins may be more dispersed, but who share some of the vowel quality patterns distinctive of southern white speakers. (Of course, other dialect features, involving phonology, syntax and prosody, serve to differentiate these two subgroups within the "southern" collection.) In terms of their distribution in the corpus, the African-American speakers happen to occur in only 8 of the calls, and all of these are in the "training" partition. In one of these 8 calls, en_4843, only one speaker is African-American, and the other is a native New Yorker of Hispanic descent. Callers in the "non-southern" (or "general") collection of CallFriend American English appear to come from a wide geographic range, based on their own reports of where they were raised. (Some identified their origins as being in the southeastern U.S.) Regardless of their geographic or ethnic backgrounds, the feature they share is the clear absence of a vowel quality pattern that would distinguish them as speakers of a "southern" dialect. -------------