SELECTION OF MACROPHONE TEST SETS


Exactly 1000 calls from the Macrophone Corpus have been set aside for
use as development test and evaluation test materials.  This
represents 20% of the 5005 calls provided in the corpus.  The
development test set consists of 500 calls, grouped arbitrarily into
five subdirectories of 100 calls each.  The evaluation test data
consists of five independent sets of 100 calls each.

The development set of 500 calls, and each evaluation set of 100
calls, is balanced for age and gender of speakers, to reflect the
distribution of age and gender in the entire 5005-call collection.
While the complete collection includes 66 calls in which the gender of
the speaker was not identified, none of these 66 calls were included
in any of the test sets.  An additional 44 calls were found to have no
information on the age of the speaker; these were also excluded from
the test selection.

The selection of development and evaluation test calls was done as
follows.  Two tables were derived from the data base table
"demogrph.tbl" -- one derived table contained only male callers, the
other contained only female callers; in both tables, entries having
"??" in the field for caller's age were discarded.  Each of these
derived tables was then sorted according to the age of the speakers.
Based on the assumption that the sequences in these age-sorted tables
were random with respect to other factors (e.g. speaker's education,
income, geographic origin, etc), every fifth entry from each table was
extracted and assigned to a test set-aside list.  As each entry was
extracted, it was removed from the age-sorted table.  Smaller
resamplings from each age-sorted table were done, always using a
uniform sampling interval across the entire table, until exactly 1000
entries were present in the set-aside list.  The set-aside list was
then divided into six groups (one devtst and five evaltest) using
simple modulo arithmetic -- i.e. all even-numbered entries were
assigned to the devtest group, and each odd-numbered entry was
assigned to one of the evaltest groups based on the final digit of its
sequence number in the list.

During the cdrom production process, the set-aside list was used to
move the designated calls into "devtst" and "eval_*" directories;
calls not on this list were moved into the "train" directory.  At a
later stage in the process, the designated test-set calls were checked
for the number of utterances present in each call.  It was found that
in five of the set-aside calls, more than half of the intended 44
utterances were missing.  For these five calls, we found a set of
matching calls in the "train" segment, having similar gender and age
traits but more utterances, and switched the positions of each matched
pair.

All test-set calls have at least 23 utterances per call; the majority
of these calls have at least 42 utterances.  Less than 20% of the 1000
test calls are missing at least one of the 15 distinct response types
(e.g. 19 of the 1000 test calls contain no WSJ sentences; another 4
calls contain no place-names).

After the five evaluation test sets were complete, each was encrypted
using the DES algorithm.  A different encryption key was assigned to
each set of 100 calls.  The development test set was left unencrypted.

The data base tables that accompany the training and development test
data contain entries for only those calls that are included in these
two data categories.  Data base table entries for the evaluation test
set calls have been segregated to a separate set of tables, which will
be published with the evaluation data, in encrypted form.