Articulation Index Corpus (please see readme.txt for a rough sketch of the DVD contents) Introduction The Articulation Index Corpus was partly inspired by the work of Harvey Fletcher, who did a number of perceptual experiments involving English syllables during the first half of the 20th century. His term "articulation index" meant something like "perceptual index of syllables" where those syllables weren't necessarily words, and reflected how well speakers could correctly identify syllables in the presence of noise. This corpus was created to facilitate similar experiments, as well as to potentially facilitate new methods in speech recognition research. The basic concept behind the corpus is to record speakers pronouncing syllables of English, some of which might be real words, but most of which are nonsense syllables. The goal was to have each speaker say a set of 2000 syllables common to all speakers, as well as a set of 20 syllables unique to that speaker. This goal was nearly met, but not precisely; see below for a description of the syllable inventory. Syllable Selection The darpabet was chosen as the representation for syllables; doc/darpabet.txt describes the subset of darpabet used in this corpus. The syllables were selected in two ways. First, all diphone (CV, VC) syllables which were considered valid English syllables were included in the common set (see below about "validity"). These syllables accounted for over 600 of the 2000. Second, the remaining syllables in the common set, as well as those in the speaker unique sets, were chosen based on a word frequency table of Switchboard 1. Phonetic representations for the words were found by lookup in PRONLEX, and then the word frequency was added to the overall syllable frequency for each component syllable of the word. The syllables for the common set and the unique sets were then chosen from the top of this list (the common set coming from higher in the list than the unique sets). This means that, while English syllable frequency might be difficult to define or calculate, the selection of syllables for this corpus does correlate to some significant extent with the observed frequency of syllables in casual conversations (as represented in Switchboard 1). These syllables chosen based on frequency are all triphones (CVC, CCV, VCC). Obviously "valid syllable" is a debatable concept. I won't go into full detail about what I considered valid, but two things in particular should be noted about my choice of syllables: syllables with schwa as their nucleus were not selected, but syllables with syllabic "r" as their nucleus were selected. The exclusion of schwa was based on it's degenerate phonological status; that is, it's not a distinctive vowel in English, and it's distribution is based on stress, which wasn't varied in this corpus. Syllabic "r" on the other hand is arguably a true vowel, based on its distribution and phonetics. Recording Procedure The recordings were made in a small, sound-treated, anechoic room at the LDC. The speakers wore two microphones during the recordings, the first a Sennheiser HMD 410 headset, the second a Nortel Liberator wireless phone headset. The former's signal went through a Symetrix 302 Dual Microphone Preamp, Sony PCM-R300 DAT deck, and Townshend Datlink, then to a Sun Sparcserver 20 were it was written to (network) disk as 16 KHz, 16-bit, pcm data. The latter's signal, transmitted to a wireless base station at a telephone, connected via the telephone network to LDC's telephone recording platforms, where the digital data was captured to disk, ie. 8 KHz, 8-bit, u-law data. The speakers were prompted via a computer interface which displayed one prompt at a time, and allowed the speaker to iterate through the prompts by pressing a "next" button. The task for each speaker was to read and say all their prompts, however this task was divided into multiple recording sessions, as the total task would take 2-4 hours, depending on the speaker's speech rate, error rate, etc. More specifically, the task was to pronounce all their prompts correctly (where correctly was a matter of judgement); prompt recordings deemed incorrect, or otherwise problematic, were re-recorded. These "redo" sessions were a combination of normal sessions, and "assisted" sessions, where a facilitator sat in the sound booth with the speaker to guide their pronunciation. A majority of the recording sessions were 15 minutes long, and the prompting program timed out after that long. Speakers sometimes did more than one session in a day, or more than one session in a row, depending on their availability, and how tedious they found the task. Initially, some sessions were done for 30 or even 60 minutes, but the session time was quickly reduced to 15 minutes due to the fact that generally people did become tired after even 15 minutes. Presentation of Syllables via Prompts It was deemed sufficient to collect a single token of each particular syllable a speaker was to say. However, due to differing research goals that end users might have, a "single token" was defined to be a phrase containing the syllable, plus the same syllable spoken in isolation. Therefore each prompt was of the form "I say blah now, blah" where blah represents the nonsense syllable. The prompting program created these prompts on the fly: it chose the syllable from a predetermined randomized list, unique to each speaker, and it chose three carrier words, inserting them all into the template given above. The template corresponded to, more or less, a "subject verb SYL adverb, SYL" sequence, where the first syllable position could be considred an "object" position. The subject, verb, and adverb words were chosen randomly from sets of words appropriate for each position. This method created prompts with enough syntactic and semantic coherency to allow the speaker to fluently pronounce them. The speakers were instructed to say the phrase fluently, but to pause at the comma so that the second occurrence was truly isolated. Generally, the latter rule was enforced, but the former was not (see below). The file "carrier.txt" contains the sets of carrier words that were used. These words don't precisely represent the actual prompts used, due to changes made along the way, but it's very close. The directory "prompts" contains one file per speaker, each file containing the actual prompts used for that speaker. This list is over 99% accurate, however there are the occasional cases where the prompt doesn't represent the actual words spoken for that syllable. In these files, each line contains the syllable, followed by the "|" character, followed by the actual text of the prompt. The syllables in the prompts were represented with real English words when they were also words; when they weren't, they were often represented with words that had parenthesized letters, which were interpreted as silent. Often various special representations were used to help elicit the correct syllable. The meaning of these representations should be obvious to the user of the corpus, when looking at the prompts, given the intended syllable, and are not worth making explicit here. Auditing and Segmentation Auditing and manual segmentation were performed on the recordings. In what follows, I use the word "prompt" to mean the recorded data elicited by some prompt, not the prompt itself. Timestamps were used to mark the beginning and end of each prompt, as well as to separate the phrase from the isolated syllable. These timestamps were then used to divide the audio data into individual files that contain either a single phrase or a single isolated syllable. A decision was also made as to the validity of each prompt, with invalid ones being added to the end of a speaker's list for re-recording (that is, the syllable from an invalid prompt was added, not the entire prompt). Mispronunciation of either instance of the syllable made the prompt invalid. However, exceptions were made if the auditor considered the "mispronunciation" to simply be the result of co-articulation and phonological variation, for example, a change in voicing due to an adjacent segment. Dialect variation was also permitted, please see the discussion on this below. There was no strict rule here, only a rule of thumb that it should sound like the intended syllable was pronounced. Generally, not pausing at the comma made the prompt invalid, since the second instance of the syllable wasn't truly isolated. However, generally, non-fluent pronunciation of the phrase was allowed, meaning that often pauses were allowed within the phrase. No stance was held on co-articulation and its possible effect on the first instance of the syllable, in that this sort of effect (or lack thereof due to pausing) was not controlled for. However, to attempt to avoid any particular bias in the data in this respect, the use of a variety of words in the contextual word sets was intended to provide a variety of phonetic contexts, which of course were chosen at random. The auditing, and therefore the segmentation, was performed on the wide-band recordings. The narrow-band recordings were not synchronized with the wide-band recordings, so to segment the narrow-band recordings using the same timestamps created for the wide-band recordings, the two recordings were aligned via a cross-correlation program. Syllable Inventories The files in inv/ represent the distribution (inventory) of syllables in the corpus by speaker and data type (wide-band vs. narrow-band). doc/inv_doc.txt documents the formats of these inventory files. There are 400 so-called "unique" syllables, 20 per speaker, and 2005 so-called "common" syllables, although in actuality only 1845 syllables are present for all 20 speakers. The remaining 160 syllables are missing from 1 to 3 speakers; that is, all common syllables are present for at least 17 speakers. These statements specifically describe the wide-band data. About 90% of the syllables recorded in the wide-band data were successfully recorded in the narrow-band data, but the other 10% were lost. The reasons for this lie in the particulars of the project design (ie. particular flaws), and to some extent the time constraints of the project. The above information, including the gaps in the narrow-band data, is represented in the files in inv/. It should be noted that the list that each speaker began with was a combination of the common list and the speakers unique list, which was then randomized for that speaker. I say "began", because as mistakes were made, those syllables were added to the end of the list. The randomization of the lists should have minimized any session effects that may have been present, in the sense that the session effects wouldn't correlate with any other characteristic of the data. Furthermore, if one wanted to test for such session effects, they could not, since the ordering information is not reconstructable given the data on this disc. The ordering is however reconstructable from LDC internal data, if by chance someone was really interested in this information. Filenames doc/darbabet.txt contains the subset of the darpabet used to represent the syllables in data files. However, syllables were represented in a modified fashion in filenames: first, @ was replaced with Q, then all uppercase letters were replaced with a two character sequence, where the first character was "x" and the second character was the lowercase equivalent of the original character. The data is separated into two directories, "wb" and "nb", for wide-band and narrow-band recordings. These directories are subdivided for "phrasal" and "syllable (isolated)" cases, which are in turn subdivided by speaker. The phrasal vs. syllable distinction, shortened to "p" vs. "s", and the speaker distinction, as indicated by four character speaker IDs, are both represented in the filename, as well as the path. For example, syls/wb/p/f101/p_f101_hxqt.sph syls/nb/s/f101/s_f101_xcxrxc.sph represent two files by the same speaker f101, the first being a wide-band recording of a phrase containing the syllable "h@t" (hat), and the second being a narrow-band recording of an isolated occurrence of the syllable "CRC" (church). The gender of the speaker, "f" or "m", is encoded in the first character of their ID; the following digits uniquely identify the speaker, with the first digit always being 1, based on the idea that a second version of this corpus should use 2. The purpose of the syllable representation scheme was to maintain the syllable as part of the filename, such that the filename was file system safe, and that the syllable was still relatively readable by both human and machine. I think (and hope) the user will find this representation convenient, once familiar with the darpabet. Issues of Dialect Variation When possible, if it was recognized that someone had a particular dialect variation, their recordings were accepted if they pronounced the syllable faithfully for their dialect. For example, one dialect variation that appears in this corpus is the merger of the two high back vowels before /l/, such that "pull" and "pool" are homophonous (both sounding like the standard version of "pool", in my experience). So, for such a speaker, their pronunciations of "pull", which is [pUl] in standard speech but [pul] in their speech, won't be useful for those interested in syllables for their specific phonetic qualities. In other words, you can't use such a speaker's "pull" if you want the phonetic sequence [pUl]. Arguably, the speaker is not capable of producing such a sequence. However, for more abstract purposes, like the speech recognition of the word "pull", you _can_ use such a person's token, since it's a correct pronunciation of that word. This was one of the reasons such phonetic "unfaithfulness" was allowed, so that speech recognition research might capitalize on this instantiation of variation. However, one serious problem emerged due to dialect variation, specifically relating to the so-called cot/caught merger. This merger, wide spread in the US and Canada, involves the two low back vowels [a] and [c], represented respectively by "cot" and "caught". Because some people on the project do not natively have this distinction, and because for those of us who do have it, like myself, the two vowels are still somewhat confusable, error entered the design of the project in at least two ways. First, many prompts for syllables with these vowels actually represented the opposite vowel, because the wrong word was chosen for the prompt. Second, many recordings were probably audited incorrectly. For example, if the desired syllable was /kat/, but they were prompted with the word "caught", the wrong syllable would have been elicited. In such a scenario, there would have also been a fair chance of the syllable being marked correct, even though it was not. This of course only affected speakers who could make the distinction; speakers with the merger would always produce the same vowel for these cases, as described above for pool/pull. In most cases the correct vowel probably was elicited, but there are certainly errors due to incorrect prompt choices, more so with the vowels [a] and [c] than with other mergers. This problem became evident late enough in the project that it wasn't practical to remedy it. For those interested, the prompts (in the "prompt" directory) will aid in the determination of which vowel was actually elicited. Note that demographic information is given in doc/speaker_inf.txt, which is documented in doc/speaker_doc.txt. Clipping The file doc/clipping.txt contains a list of the 113 files where clipping occurred, ie. files who had samples whose values were either the maximum or minimum possible value (based on the wide-band recordings). The text file lists the audio files containing clipping, along with the number of clipped samples. The worst case is a file containing 9 clipped samples, and the average number of clipped samples for these 113 files is about 2.2. Conversational Data A small amount of conversational data was collected in addition to the syllable data. The description of the data is included here, however, the data itself is not included on this disc (for space reasons), and is available on a separate CD. About 10 minutes of conversational data was collected per speaker, although not for every speaker. The speakers were seated in the sound booth two at a time, one speaker wearing the same microphone setup for syllable recording, and the other wearing a third, lavalier mic. Each *_wb.sph file is a two channel file, containing the wide-band recording of the first participant, and the sole recording (wide-band) of the second participant, at 16 KHz, 16-bit pcm. The corresponding *_nb.sph file has the narrow-band recording for the first participant, at 8 KHz, 8-bit ulaw. This is the same scenario as the syllable recordings, except that the second participant was added as a second channel to the wide-band recording. The speakers were given a list of possible topics, but were not forced to pick one of them, or any particular topic at all. The file topics.txt gives a rough idea of the topics chosen. Generally, two participants were recorded for five minutes, then asked to switch mics and pick a new topic, then were recorded for another five minutes. The filenames for the conversations represent the two speakers involved, as well as wide-band vs. narrow-band recording, ie. "wb" vs. "nb". I've here considered the "first" speaker to be the one wearing the microphones used for the syllable recordings. This speaker always corresponds to the first ID in the filename. I point this out because, for some reason, the first speaker doesn't always appear in the first channel in the audio data. In the case of the conversations, the narrow-band recordings were aligned with the wide-band recordings visually, using xwaves, rather than with the cross-correlation program, so are accordingly not as precisely aligned. Acknowledgments and Contact Info Many people deserve credit for the creation of this corpus. Mark Liberman, Jont Allen, Nelson Morgan, George Doddington, and others in the Novel Approaches group provided important conceptual advice in the design of the project. Chris Cieri provided continual guidance on many aspects of the project, both conceptual and practical. Dave Graff and Kevin Walker provided invaluable technical support. Special thanks to Dave Graff for assistance with the corpus documentation; much of the readability and accessibility of the documentation is due to him. Many of my fellow Penn Linguistics students provided crucial assistance with the work of the project, especially James Mesbur. As the implementor of this corpus, I take sole credit for all mistakes and shortcomings. Please don't hesitate to contact me with questions or problems, especially since we hope this corpus will prove to be the pilot for a subsequent, larger corpus. Your use of this corpus will allow us to determine the usefulness of data like this, and how such data might be improved. For all those who have been waiting for this corpus, thanks so much for your patience. Jonathan Wright jdwright@ldc.upenn.edu 11/20/03