Exactly 1000 calls from the Macrophone Corpus have been set aside for use as development test and evaluation test materials. This represents 20% of the 5005 calls provided in the corpus. The development test set consists of 500 calls, grouped arbitrarily into five subdirectories of 100 calls each. The evaluation test data consists of five independent sets of 100 calls each.
The development set of 500 calls, and each evaluation set of 100 calls, is balanced for age and gender of speakers, to reflect the distribution of age and gender in the entire 5005-call collection. While the complete collection includes 66 calls in which the gender of the speaker was not identified, none of these 66 calls were included in any of the test sets. An additional 44 calls were found to have no information on the age of the speaker; these were also excluded from the test selection.
The selection of development and evaluation test calls was done as follows. Two tables were derived from the data base table "demogrph.tbl" -- one derived table contained only male callers, the other contained only female callers; in both tables, entries having "??" in the field for caller's age were discarded. Each of these derived tables was then sorted according to the age of the speakers. Based on the assumption that the sequences in these age-sorted tables were random with respect to other factors (e.g. speaker's education, income, geographic origin, etc), every fifth entry from each table was extracted and assigned to a test set-aside list. As each entry was extracted, it was removed from the age-sorted table. Smaller resamplings from each age-sorted table were done, always using a uniform sampling interval across the entire table, until exactly 1000 entries were present in the set-aside list. The set-aside list was then divided into six groups (one devtst and five evaltest) using simple modulo arithmetic -- i.e. all even-numbered entries were assigned to the devtest group, and each odd-numbered entry was assigned to one of the evaltest groups based on the final digit of its sequence number in the list.
During the cdrom production process, the set-aside list was used to move the designated calls into "devtst" and "eval_*" directories; calls not on this list were moved into the "train" directory. At a later stage in the process, the designated test-set calls were checked for the number of utterances present in each call. It was found that in five of the set-aside calls, more than half of the intended 44 utterances were missing. For these five calls, we found a set of matching calls in the "train" segment, having similar gender and age traits but more utterances, and switched the positions of each matched pair.
All test-set calls have at least 23 utterances per call; the majority of these calls have at least 42 utterances. Less than 20% of the 1000 test calls are missing at least one of the 15 distinct response types (e.g. 19 of the 1000 test calls contain no WSJ sentences; another 4 calls contain no place-names).
After the five evaluation test sets were complete, each was encrypted using the DES algorithm. A different encryption key was assigned to each set of 100 calls. The development test set was left unencrypted.
The data base tables that accompany the training and development test data contain entries for only those calls that are included in these two data categories. Data base table entries for the evaluation test set calls have been segregated to a separate set of tables, which will be published with the evaluation data, in encrypted form.