The 1997 speaker recognition evaluation is part of an ongoing series of yearly evaluations conducted by NIST. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of text independent speaker recognition. To this end the evaluation was designed to be simple, to focus on core technology issues, to be fully supported, and to be accessible.
The 1997 evaluation will be conducted May. A follow-up workshop for evaluation participants will be held during June, to discuss research findings. Participation in the evaluation is solicited for all sites that find the task and the evaluation of interest. For more information, and to register a desire to participate in the evaluation, please contact Dr. Alvin Martin at NIST.1
The current speaker recognition evaluation focuses on the task of speaker detection. That is, the task is to determine whether a specified target speaker is speaking during a given speech segment.2 This task is posed in the context of conversational telephone speech and for limited training data. The evaluation is designed to foster research progress, with the goals of:
Speaker detection performance will be evaluated by measuring the correctness of detection decisions for an ensemble of speech segments. These segments will represent a statistical sampling of conditions of evaluation interest. For each of these segments a set of target speaker identities will be assigned as a test hypotheses. Each of these hypotheses will then be required to be judged as true or false, and the correctness of these decisions will be tallied.3
The formal evaluation measure will be a detection cost function, defined as a weighted sum of the miss and false alarm error probabilities:
The parameters of this cost function are the relative costs of detection errors, CMiss and CFalseAlarm, and the a priori probability of the target, PTarget. The primary evaluation will use the following parameter values:
In addition to the (binary) detection decision, a decision score will also be required for each test hypothesis.4 This decision score will be used to produce detection error tradeoff curves, in order see how misses may be traded off against false alarms.
There will be 3 training conditions for each target speaker. All 3 of these conditions will use 2 minutes of training speech data from the target speaker. The 3 conditions are:
The actual duration of the training files will vary from the nominal value of 1 minute, so that whole turns may be included whenever possible. Actual durations will be constrained to be within the range of 55-65 seconds.
Performance will be computed and evaluated separately for female and male target speakers and for the 3 training conditions. For each of these training conditions, there are 2 different test conditions of interest. These are:
The development data for this evaluation will comprise the DevSet and EvalSet for last year's evaluation. The 1996 DevSet is one CD-ROM labeled sid96d1 and the 1996 EvalSet is two CD-ROM's labeled sid96e1f and sid96e1m. Sites intending to perform the evaluation and to submit results to NIST may acquire these development data and associated documentation from NIST free of charge by contacting Dr. Martin.
The evaluation data will be drawn from the SwitchBoard-2 phase 1 corpus.6 Both training and test segments will be constructed by concatenating consecutive turns for the desired speaker, similar to what was done last year. Each segment will be stored as a continuous speech signal in a separate SPHERE file. The speech data will be stored in 8-bit mulaw format. The SPHERE headers will include auxiliary information to document to source file, start time and duration of all excerpts which were used to construct the segment.7
NIST will manually audit all segments to verify that the selected speech is for the identified speaker and does not include any significant extraneous speech from other speakers. There will be between 400 and 500 speakers that will serve both as target speakers and as non-target (impostor) speakers.8 There will be additional speakers that will serve only as impostors.
The evaluation corpus will be supplied on 6 CD-ROM's. For convenience, data will be grouped according to sex and stored separately - three discs for female data and three discs for male data. Knowledge of the sex of the target speaker is admissible side information and may be used if desired.
The evaluation data will include both training data and test data. The number of test segments from each target speaker will vary, with an average of about 10 test segments per target speaker and test duration. (For each speaker, each of the test segments of a given duration will be from a unique conversation for that speaker.) This will make a total of about 2500 test segments for each sex and for each of the three test durations.9
A total of nine tests constitute the evaluation. These tests are namely a test for each of the three test durations for each of the three training conditions. Every evaluation participant is required to submit all of the results for each test performed.10 In the event that a participating site does not submit a complete set of results, NIST will not report any results for that site. For all nine tests in this evaluation, there will be a grand total of about 50,000 target speaker trials and 500,000 non-target speaker trials (see Evaluation Data Set Organization below).
The following evaluation rules and restrictions on system development must be observed by all participants:
All of the six disks in the EvalSet will have the same organization. Each disk's directory structure will organize the data according to information admissible to the speaker recognition system. This directory structure will be as follows:
Sites participating in the evaluation must report test results for all of the tests. These results must be provided to NIST in results files using a standard ASCII record format, with one record for each decision. Each record must document its decision with target identification, test segment identification, and decision information. Each record must thus contain seven fields, separated by white space and in the following order:
A brief description of the system (the algorithms) used to produce the results must be submitted along with the results, for each system evaluated. (It is permissible for a single site to submit multiple systems for evaluation. In this case, however, the submitting site must identify one system as the "primary" system prior to performing the evaluation.)
Sites must report the CPU execution time that was required to process the test data, as if the test were run on a single CPU. Sites must also describe the CPU and the amount of memory used.
1 To contact Dr. Martin, you may send him email at alvin@jaguar.ncsl.nist.gov, or you may call him at (301 975-3169.
2 Speaker detection is chosen as the task in order to focus research on core technical issues and thus improve research efficiency and maximize progress. Although important application-level issues suggest more complex tasks, such as simultaneous recognition of multiple speakers, these issues are purposely avoided. This is because these application-level challenges are believed to be more readily solvable, if only the performance of the underlying core technology were adequate, and it is believed that the R&D effort will be better spent in trying to solve the basic but daunting core problems in speaker recognition.
3 Note that explicit speaker detection decisions are required. Explicit decisions are required because the task of determining appropriate decision thresholds is a necessary part of any speaker detection system and is a challenging research problem in and of itself.
4 Note that decision scores from the various target speakers will be pooled before plotting detection error tradeoff curves. Thus it is important to normalize scores across speakers to achieve satisfactory detection performance.
5 The "same handset" condition in this evaluation is not really a fair test. This is because all of the impostor data is collected from handsets that are different from those used for the target speakers' training data. Thus it is only the target speakers who use the same handset. The impostors use different handsets and thus are more easily discriminated against.
6 The SwitchBoard-2 phase 1 corpus was created by the University of Pennsylvania's Linguistic Data Consortium (LDC) for the purpose of supporting research in speaker recognition. Information about this corpus and other related research resources may be obtained by contacting the LDC (by telephone at 215/898-0464 or via email at ldc@upenn.edu).
7 For information about NIST's SPHERE utilities (including instructions to download SPHERE utilities) visit the NIST, Spoken Natural Language Processing Group's (http://www.nist.gov/speech) website. The source time-marks are documented in each test segement's SPHERE header. The field segment_origin lists the information used in constructing the test segment. A segment_origin record is of the type: segment_origin=[conversation_id,channel,start_time,end_time]...
8 In the 1996 evaluation, speakers were identified as either "target" or "non-target". This distinction (namely of being a target or a non-target) was associated with the speaker. This year, the appellation of "target" or "non-target" is associated with the speaker's role rather than the speaker's identity. The reason for the change is that the distinction made no difference - the results from last year's evaluation demonstrated that performance was insensitive to whether the impostor was a target speaker or a non-target speaker.
9 For 1997 Evaluation the number of test segments will be limited to 2500, but there will be more then 2500 segments on the evaluation CDs. These extra segments may be useful in future development work.
10 Participants are encouraged to do as many tests as possible. However, it is absolutely imperative that results for all of the test segments and target speakers in a test be submitted in order for that test to be considered valid and for the results to be accepted. If a participant anticipates being unable to complete all the tests, NIST should be consulted for preferences about which tests to perform. Each participant must negotiate its test commitments with NIST before NIST ships the evaluation CD-ROM's to that site.
11 The reason for this rule is that the technology is viewed as being "application-ready". This means that the technology must be ready to perform speaker detection simply be being trained on a specific target speaker and then performing the detection task on whatever speech segments are presented, without the (artificial) knowledge of the speech of other speakers and other segments.
12 This is a nominal requirement, because the LDC have not yet made the SwitchBoard-2 phase 1 corpus publicly available.
13 Ten target ID's per test segment was chosen to maximize the efficiency of the evaluation for a given level of statistical significance. This results from the performance design goal - given a false alarm probability 10 times lower than the miss probability, it takes ten times more impostor trials to produce equal numbers of miss and false alarm errors.
14 The Maritime Institute is in the Baltimore-Washington area, not far from BWI International airport.