The reference transcription will be transformed prior to comparing it with the output from a recognizer. It is important that these transformations are properly comprehended in the design of a recognition system, so that the system will perform well according to the scoring measure. Here are the transformations that will be applied to the reference:
Word fragments are represented in the transcription by appending a "-" to the (partial) spelling of the fragmented word. Fragments are included in the total word count and scored as follows:
The reference transcripts may describe some speech as unintelligible (indicated by "(( ))"), and then may or may not also provide a "best guess" as to what words it consists of. Such "best guess" doubtful words will be included in the total word count, with scoring as follows:
The reference transcripts may describe words as foreign, as words not in the language under test. This description will not be applied to words of foreign origin that have been widely incorporated into speech of the given language. Such foreign words will be included in the total word count, with scoring as follows:
For scoring purposes, all hesitation sounds, referred to as "non-lexemes", will be considered to be equivalent, and will be scored the same way as fragments, doubtful, and foreign words. Although these sounds are transcribed in a variety of ways due to highly variable phonetic quality, they are all considered to be functionally equivalent from a linguistic perspective. Thus, all reference transcription words and hypothesized words in the conventional set of hesitation sounds, will be mapped to "%hesitation". The system output transcriptions should use either the "%hesitation" token or any of the hesitation sounds when a hesitation is hypothesized or omit it altogether. Again:
The set of Spanish hesitation
sounds are:
The set of Mandarin hesitation
sounds are:
Homophones will not be treated as equivalent. Homophones must be correctly spelling in order to be counted as correct.
Words in the acoustic training data which appear with multiple spellings (including mispellings) will be treated as equivalent via csrfilt.sh and the above word-mapping files. Since the training data is also known to suffer from certain homophone errors ("its" and "it's"), these will be treated as equivalent as well. Note that in the past, we have not allowed homophone errors when a language model was present, but since this year's language models could be confused by the acoustic training data, we will score homophone errors which also occur in the acoustic training data as correct. However, please note that we do not plan to continue this practice in the future.
*** NOTE: Not Applicable to the Spanish & Mandarin Hub-4NE test ***
Periods of overlapping speech will not be scored. Any words hypothesized by the recognizer during these periods will not be counted as errors.
Compound words which appear as single and multiple words in the acoustic training data will be treated as equivalent. New compound words will be looked up in the American Heritage Dictionary, third edition and on the World Wide Web. If the compound word exists in these sources as only a single compound word, it will only be scored as one word. If, however, it is also listed as two separate words or as a hyphenated word, it will be scored as separate words. The equivalences will be handled by csrfilt.sh and the above word-mapping files.
*** NOTE: Not Applicable to the Spanish & Mandarin Hub-4NE test ***
For languages where contracted words are commonly used (e.g., English), contractions will be expanded to their underlying forms in the reference transcriptions. Manual auditing will be used to ensure correct expansion. Contractions in the recognizer output will be expanded based on default expansions for standard contractions in the language. Thus the recognizer need not expand contractions, but it may be preferable for it to do so.
*** NOTE: Not Applicable to the Spanish & Mandarin Hub-4NE test ***The file must be sorted according to the first three fields in each record: the first and the second in ASCII order, and the third in numeric order. This can be accomplished using the UNIX sort command: "sort +0 -1 +1 -2 +2nb -3".
See the manual page for ctm(5) supplied in the sctk distribution for a complete description of the file format.
*** NOTE: The bn_filt.pl script MAY NOT be used for this evaluation, whatever tool we do use to prepare the reference transcripts, will be released by December 2. ***
In order for the scoring software to be used, the reference transcripts must be processed as follows:
The command must end with the transcript filename followed by the basename to be used for the output files.
The rule file for the evaluation test data sets are:
For this evaluation, a scoring wrapper script, described below, has been supplied which will pre-filter both the reference and hypothesis transcript and before scoring. If you are scoring the default Hub-4NE evaluation, no manual prefiltering is required.
Both packages require compilation, see the readme's in their respective directories for installation instructions.
The pre-filtering, alignment and scoring process can be performed with a single execution of hubscr02.pl as follows:
<HYP>.sys: | A summary of speaker performance in terms of Percent: Correct, Substitutions, Deletions, Insertions, Word Errors and Sentence (or Utterance) errors. Speaker averages, means, medians and standard deviations are computed for each percentage. |
<HYP>.raw: | A summary similar to 'ex1.ctm.sys' except the output is word counts instead of percentages. |
<HYP>.pra: | A text copy of all the string alignments. |
For the Hub-4NE evaluation, an additional report will be produced via the "-o lur" option.
<HYP>.lur: | A report containing a scoring summary of the system broken down into sub-categories for each speaker. |
SITE/SYSTEM NAME
HUB-4NE {CORE/CONTRAST} TEST
The submission process consists of 3 steps:
For each test system, create a sub-directory under your 'SITE' directory identifying the system's name or key attribute. The sub-directory name is to consist of a free form system identification string 'SYSID' chosen by you. Place all files pertaining to test runs using a particular system in the same SYSID directory.
The following file must be generated for each system used in the tests. Only one copy of the file need be generated if the system is used for multiple tests/conditions:
Place your system description in the file, 'sys-desc.txt'.
Create a system output file, '<TEST_SET>.hyp', for each primary or contrastive test (where, <TEST_SET> corresponds to the root portion of the index file name.) The list of <TEST_SET> names is included below.
E-mail method:
tar -cvf - ./<SITE> | compress | \ uuencode <SITE>-<SUBM_ID>.tar.Z | \ mail -s "Nov97 Hub-4NE test results <SITE>-<SUBM_ID>" \ jonathan.fiscus@nist.gov
where,
where,
This command creates a single file containing all of your results. Next, ftp to jaguar.ncsl.nist.gov giving the username 'anonymous' and your e-mail address as the password. After you are logged in, issue the following set of commands, (the prompt will be 'ftp>'):
You've now submitted your recognition results to NIST. The last thing you need to do is send an e-mail message to Jon Fiscus at jonathan.fiscus@nist.gov notifying NIST of your submission. Please include the name of your submission file in the message.
Note: If you choose to submit your results in multiple shipments, please submit ONLY one set of results for a given test system/condition unless you've made other arrangements with NIST. Otherwise, NIST will programmatically ignore duplicate files.
<SITE>/<SYSID>/<FILES>
SYSID ::= (short system description ID, preferably <= 8 characters)
FILES ::= sys-desc.txt | <TEST_SET>.ctm
The time-marked hypothesis words for the Hub-4NE test will be placed in a single file, called "<TEST_SET>.ctm". The CTM file format, is a concatenation of time marks for each word in each broadcast. Each word token must have a broadcast id, channel identifier (1 in the case of Hub-4NE), start time, duration, and case-insensitive word text. Optionally a confidence score can be appended for each word. The start time must be in seconds and relative to the beginning of the waveform file. The broadcast id for the Hub-4NE tests are "h4ne97sp" and "h4ne97ma".
The file must be sorted by the first three columns: the first and the second in ASCII order, and the third by a numeric order. The UNIX sort command: "sort +0 -1 +1 -2 +2nb -3" will sort the words into appropriate order.
Lines beginning with ';;' are considered comments and are ignored. Blank lines are also ignored.
Included below is an example:
;; ;; Comments follow ';;' ;; ;; The Blank lines are ignored ;; h4ne97sp 1 11.34 0.2 NO 0.763 h4ne97sp 1 12.00 0.34 HABLO 0.384530 h4ne97sp 1 13.35 0.5 ENGLISH 0.806418