Foreign Accented English Corpus Release 1.2 Center for Spoken Language Understanding UPDATED: 3 June 2002 Overview -------- The Foreign Accented English (FAE) corpus consists of American English utterances by non-native speakers. The corpus contains 4925 telephone quality utterances from native speakers of 22 languages. Three independent judgements of accent were made on each utterance by native American English speakers. Recording Conditions -------------------- The data were collected with the CSLU T1 digital data collection system. The sampling rate was 8 khz and the files were stored in 8-bit m-law format. (These files have been converted to the RIFF standard file format. This file format is 16-bit linearly encoded.) File Naming Conventions ----------------------- Each utterance is stored in an individual file, whose name indicates the language and session number of the caller. For example: FAR00100.wav The leading 'F' specifies that the file is a part of the FAE corpus. The next two letters, "AR" in this case, indicate the native language of the speaker. The final 5 digits represent the session number that was assigned during recording. The "wav" extension indicates that this is a speech file. If the file has a corresponding information file (see the verification section below) the file will be named the same but with an "inf" extension instead of "wav". Table of Languages ------------------ Native speakers of the following languages are represented in this corpus: AR Arabic BP Brazillian Portuguese CA Cantonese CZ Czech FA Farsi FR French GE German HI Hindi HU Hungarian IN Indonesian IT Italian JA Japanese KO Korean MA Mandarin MY Malay PO Polish PP Iberian Portuguese RU Russian SD Swedish SP Spanish SW Swahili TA Tamil VI Vietnamese Speech File Formats ------------------- The speech files in this corpus are stored in the RIFF standard file format. This file format is 16-bit linearly encoded. Verification ------------ Some of the files in this corpus are also included in the CSLU 22 Language Speech corpus. Those files have been verified by a native speaker of the language. A variety of information about the speaker was collected into an "info" file. There are info files for 1785 of the calls, since native speakers have not yet screened all of the calls. As an example, these are the contents of AR00145.inf: 145 general dialect bahrain 145 general gender male 145 general age adult 145 general connection good 145 general intelligibility good The first field is the call number, the second is the comment category (all are general), the third field contains the variety of information being presented, and the final field is the value of that particular item. Thus this file tells us that the speaker is an adult male who speaks the Bahrain dialect of Arabic. We can also see that the level of connection (line) quality and speaker intelligibility were good. Accent Judgements ----------------- Three native speakers of American English independently listened to each utterance. They made judgements of the accent on a 4-point scale, according to the following guidelines. 1. Negligible/No Accent: Not accented at all, or difficult to determine if there is even an accent present. 2. Mild Accent: Accent can be heard through most of the speech, but does not hinder understanding. 3. Strong Accent: The accent is strong in all speech, and makes understanding difficult. 4. Very Strong Accent: Intelligibility is hindered, and multiple listenings were necessary to understand the speaker. The accent judgements were based solely on the phonetic variation caused by the foreign language influence. They were not based on improper grammar or word choice. Error Checking -------------- A list of all calls which were judged "1" by one judge, and "4" by another was generated and these conflicts were checked by one of the judges. During this phase, judges could only change their own incorrect judgements. If a judge was not available to check their side of a "1/4" conflict, then the utterance was excluded from the corpus. A total of 29 utterances were excluded from the corpus for this reason. If the utterance has a "-" for its accent judgement, then it was not heard by that judge. The judgement information is located in the file called judge.db in the doc directory. The file contains one line for each utterance in the corpus, with the three accent judgements and the name of the file. The file format is: AR00145 3 2 3 This example tells us that judges one and three felt that the speaker had a strong(3)accent, while judge two felt that the accent was mild(2). Confusion Matrices ------------------ We generated the following confusion matrices to show the agreement between the three judges based on language. Judge 1 vs. Judge 2 (1vs2.txt) Judge 1 vs. Judge 3 (1vs3.txt) Judge 2 vs. Judge 3 (2vs3.txt)