*************************** A T T E N T I O N ********************************
* This document provided the output preparation, self-scoring                *
* instructions, and submission instructions for the November 1996            *
* DARPA/NIST CSR Hub-4 Evaluation.  You may use this document to             *
* implement the same protocols used in the 1996 Evaluation, but please       *
* do NOT submit your results to NIST.                                        *
*************************** A T T E N T I O N ********************************


    DARPA CSR 1996 Broadcast News Hub-4 Benchmark Test Evaluation

System Output Preparation, Scoring Protocols, and Submission Instructions
				   
			   Revised 11/27/96
				   

			       Contents
			       --------

1.0  Scoring the H4 Broadcast News Benchmark Tests
	1.1  Preparation of hypothesized transcripts
	1.2  Preparation of reference transcripts
	1.3  Transcription pre-filtering
		1.3.1 Word fragments
		1.3.2 Compound words
		1.3.3 Multiple representations
		1.3.4 Contractions
		1.3.5 Pause fillers
		1.3.6 Overlapping speech
	1.4  Running the NIST scoring software

2.0  Scoring software output 

3.0  System descriptions

4.0  Submission of test results to NIST
	4.1  Due dates
	4.2  Test results Format
	4.3  File and directory formats


Sections 1.0 and 2.0 describe the process to be used in preparing Hub-4
system output for scoring and for implementing the NIST scoring
software.  Section 3.0 describes the format for submission of system
information.  Section 4.0 describes the protocol for submitting your
recognition output to NIST for official scoring.


1.0  Scoring the H4 Broadcast News Benchmark Test
-------------------------------------------------

This section describes the process for preparing system-generated
hypothesis and reference transcriptions for scoring and for implementing
the NIST scoring software on these files.


1.1  Preparation of Hypothesized Transcripts
--------------------------------------------

The hypothesis transcripts are to be formatted in the CTM format.
This format is a concatenation of time mark records for each word in
each channel of a waveform.  Each record, separated by a newline, must
have a waveform id, channel id (1 for the Hub-4 data), start time,
duration, word text, and optionally, a confidence score.  The waveform
id for each CTM record will be the file ID given in the PE or UE map
file record used.

The file must be sorted according to the first three fields in each
record: the first and the second in ASCII order, and the third in
numeric order.  This can be accomplished using the UNIX sort command:
"sort +0 -1 +1 -2 +2nb -3".

See the manual page for ctm(5) supplied in the sclite distribution for
a complete description of the file format.


1.2  Preparation of Reference Transcripts
-----------------------------------------

Prior to scoring, the Hub-4 reference transcripts must be converted to
the segment time marked (STM) format used by the sclite scoring
software.  The filter, 'bn_filt.pl' (version 1.6), located under the
'bn_filt.pl' directory in the top level of this disc, produces a
derivation of the original Hub-4 transcript suitable for scoring.

In order for the scoring software to be used, the reference transcripts
must be processed as follows:
	1.  Only the excerpts used in the test may be input to the scoring 
            software
	2.  Each of the reference excerpts must be run through bn_filt.pl
 	    individually
	3.  The filtered excerpts must then be concatenated into a single 
 	    file in the same order as in the concatenated hypothesis 
 	    transcript.  

Example execution of bn_filt.pl using the Hub-4 devtest data:

    % bn_filt.pl -s h496_spkrdb_960917 -f stm,uem,pem -b 127 -e 1869\
                 i96071p.txt i96071p

	-s speaker database (filename of speaker database)
	-f stm,uem,pem (indicates output types (files) to be produced)
	-b beginning time
	-e ending time
	
	The command must end with the transcript filename followed by
	the basename to be used for the output files.


1.3  Transcription Pre-filtering
--------------------------------

The reference and hypothesis (system-generated) transcriptions will be
"pre-filtered" prior to scoring to remove certain ambiguities
according to a set of rules located in a pair of rule files.  It is
known that variant and erroneous spellings of words exist in the
acoustic training data.  These variants will be mapped to a single
"canonical" representation using this pre-filter.  The two rule files
have been developed to cover the 1996 H4 acoustic training,
development test, and evaluation test material.  The rule files and
utility to perform the pre-filtering operation are located in the
"/tranfilt" directory.  "Tranfilt" version 1.4 is required for this
evaluation.

The rule files for the evaluation test data are "et96_1.glm" and
"et96_1.utm".  The file, "et96_1.glm", contains rules for global
substitutions of lexical equivalents and the file, "et96_1.utm",
contains rules for utterance-specific lexical equivalents.  Note that
"et96_1.utm" is empty, but is still required by the program.

*** NOTE: The word-mapping files will not be released until December 12. ***

The Bourne shell script, "csrfilt.sh', is used to apply the mapping
rules in the above files.  The "/tranfilt" directory contains
compilation and installation instructions for the script.  The
script operates as a simple UNIX filter that reads the reference and
hypothesis transcriptions from "stdin" and writes the filtered
transcriptions to "stdout".  The format for using the utility is as
follows:

   csrfilt.sh -dh global-map-file utterance-map-file < filein > fileout

Once a STM file has been filtered using bn_filt.pl, it must be 
filtered using csrfilt.sh using the above command.

The hypothesis file, in CTM format, must be filtered slightly 
differently via the additional option "-i ctm", so that it is
parsed correctly.

Note: The flag "-dh" replaces all hyphens with spaces so that
hyphenated words are scored as separate lexemes.  This option must be
used on both the reference and hypothesis transcript for correct
scoring.

Example use of csrfilt.sh:

(reference transcripts)
   csrfilt.sh -i stm -dh et96_1.glm et96_1.utm < ref.stm > ref.stm-filt

(hypothesis transcripts )
   csrfilt.sh -i ctm -dh et96_1.glm et96_1.utm < hyp.ctm > hyp.ctm-filt

The .stm-filt and .ctm-filt files will be used as input to the scoring
software.


1.3.1 Word Fragments
--------------------

Word Fragments, i.e., partially-pronounced words, will be scored using
the protocol developed for the LVCSR Tests.  When fragments occur in a
reference transcript, the following 2 additional rules are applied
during scoring:

	1.  fragment deletions are forgiven (i.e., as if they never occurred)
	2.  substitutions where a transcribed fragment is included as a 
            substring of the aligned word in the recognized string are 
	    counted as correct

All other insertions and substitutions will be scored as errors.

Examples:

Ref: the dollar rose shar- today

Hyp: the dollar rose       today  (deletion ignored - no error)
Hyp: the dollar rose sharp today  (superstring substitution scored as correct)
Hyp: the dollar rose shape today  (substitution error)


Note that the above procedures have been used in recent LVCSR tests.
	

1.3.2 Compound Words
--------------------

Compound words which appear as single and multiple words in the
acoustic training data will be treated as equivalent.  New compound
words will be looked up in the American Heritage Dictionary, third
edition and on the World Wide Web.  If the compound word exists in
these sources as only a single compound word, it will only be scored
as one word.  If, however, it is also listed as two separate words or
as a hyphenated word, it will be scored as separate words.  The
equivalences will be handled by csrfilt.sh and the above word-mapping
files.

*** NOTE: The word-mapping files will not be released until December 12. ***


1.3.3 Multiple Representations
-----------------------------

Words in the acoustic training data which appear with multiple
spellings (including mispellings) will be treated as equivalent via
csrfilt.sh and the above word-mapping files.  Since the training data
is also known to suffer from certain homophone errors ("its" and
"it's"), these will be treated as equivalent as well.  Note that in
the past, we have not allowed homophone errors when a language model
was present, but since this year's language models could be confused
by the acoustic training data, we will score homophone errors which
also occur in the acoustic training data as correct.  However, please
note that we do not plan to continue this practice in the future.

*** NOTE: The word-mapping files will not be released until December 12. ***


1.3.4 Contractions
------------------

Contractions in the recognition output will be expanded to an
alernation containing all possible expansions relative to context via
csrfilt.sh and the global word-mapping file.

E.g., she's -> she {has/is}

The transcript will use the new "<Expand>" SGML tag to indicate the
proper expansion of each contraction relative to context.  The sytax
for the tag is described in the revised annotation document (Ver 3.8).
The transformation to the expanded form will be accomplished by
bn_filt.pl.

The alternated/expanded hypothesis file will then be scored against
the expanded reference file.

*** NOTE: The word-mapping files will not be released until December 12. ***

Additional versions of the word-mapping file may be released after the
above date to accomodate later submissions.


1.3.5 Pause Fillers
-------------------

Non-word pause fillers, such as um, uh, hmm, err, etc. will be filtered 
from the reference transcripts prior to scoring.  Each site MUST
remove such pause fillers from their system output before submission
for scoring.  These pause fillers are removed from the STM file
via bn_filt.pl.


1.3.6  Overlapping Speech
-------------------------

Areas of overlapping speech will not be scored in this evaluation.
The transcript will contain a new SGML tag to indicate overlapping
speech.  The syntax for the tag is described in the revised annotation
document to be released next week.  Words recognized during the tagged
overlap times will not be scored.  This is accomplished by the
separation of the overlapping text into a specially tagged STM record
indicating that scoring should not be performed.  


1.4  Running the NIST scoring software on Hub-4 data
----------------------------------------------------

In scoring the Hub-4 tests, NIST will use word alignments produced by
the NIST SCLITE Version 1.4 scoring package which is included on this
disc.

The scoring package has been included in the top-level directory,
"/sclite1.4" of this release.  The directory contains a "readme" file
with compilation and installation instructions. 

The alignment and scoring process can be performed with a single
command.  Be sure to use pre-filtered concatenated reference and
hypothesis transcriptions as described in Sections 1.2 and 1.3 above.
To score a reference transcription against a corresponding
system-generated hypothesis transcription, use the "sclite" program as
follows:

   sclite -F -r ref.stm-filt stm -h hyp.ctm-filt ctm -o all lur

More detailed documentation for using sclite is located in the man
page, "/sclite1.4/doc/sclite.1" or the HTML file,
"/sclite1.4/doc/sclite.htm".   On UNIX, after installation, the man page
may be accessed via "man sclite".


2.0  Scoring Software Output
----------------------------

The sclite program not only aligns the reference and hypothesis texts,
but also generates scoring reports for each hypothesis input file.
The scoring report file names are created by appending various
extensions to the hypothesis file name.  The following set of output
files are generated via the "-o all" command line option.

      <HYP>.sys:        A summary of speaker performance in terms of
                        Percent: Correct, Substitutions, Deletions,
                        Insertions, Word Errors and Sentence (or
                        Utterance) errors.  Speaker averages, means,
                        medians and standard deviations are computed
                        for each percentage.
 
      <HYP>.raw:        A summary similar to 'ex1.ctm.sys' except the
                        output is word counts instead of percentages. 
 
      <HYP>.pra:    A text copy of all the string alignments.

For the Hub-4 evaluation, an additional report will be produced
via the "-o lur" option.

      <HYP>.lur:        A report containing a scoring summary of
                        the system broken down into sub-categories for each
			speaker.


3.0  System Descriptions
------------------------

As part of the November 1996 CSR Tests, each test site is required to
generate a description of the systems used in each Hub-4 test
configuration according to a prescribed format.  The format for the
system description is as follows:

	SITE/SYSTEM NAME
	HUB-4 {CORE/CONTRAST} TEST

1) PRIMARY TEST SYSTEM DESCRIPTION:

2) ACOUSTIC TRAINING: 

3) GRAMMAR TRAINING:

4) RECOGNITION LEXICON DESCRIPTION:

5) DIFFERENCES FOR EACH CONTRASTIVE TEST:

6) NEW CONDITIONS FOR THIS EVALUATION:

7) REFERENCES:


4.0  Submission of Test Results to NIST
---------------------------------------

The following describes the formats and protocols for submitting
your results to NIST for scoring.


4.1  Due Dates
--------------

ALL results for the Hub 4 Core Tests MUST be received at NIST by 0700
(EST) Thursday, December 12 to be scored as "official".

ALL results for the Hub 4 Contrast Tests MUST be received at NIST by 0700
(EST) Thursday, December 19 to be scored as "official".

RESULTS RECEIVED AFTER THE ABOVE DEADLINES WILL BE SCORED AND INCLUDED
IN THE SUMMARY TABULATION TO BE PREPARED BY NIST FOR THE FEBRUARY
WORKSHOP.  HOWEVER, THESE RESULTS WILL BE MARKED WITH THE LABEL, "LATE
- (DATE OF RECEIPT)" AND THEY WILL NOT BE CONSIDERED "OFFICIAL".

Full CSR Test Schedule/Deadlines:

October 28	Last day to "enter or withdraw" 

November 8      Deadline for optional submission of devtest results

November 11     Distribution of evaluation test data

December 12	(0700 EST) - deadline for submission of core evaluation results

December 16	(0500 EST) Post scored core test results

December 19	(0700 EST) - deadline for submission of contrast results

December 23	(0500 EST) Post scored contrast results

February 2-5	DARPA Speech Recognition Workshop, Westfields Conference
		Center, Chantilly, VA


4.2  Test Results Format
------------------------
The steps and format for submitting results will be the same as last year.  

The submission process consists of 3 steps:

     1) directory structure creation,

     2) system documentation and inclusion of hypothesis recognition
        output,

     3) transmission protocol to NIST.

Attached is an example system description template and an example of
the steps taken to create the submission directory structure.


Step 1: Directory Structure Creation

Create a directory identifying your site ('SITE') from the following
list which will serve as the root directory for all your submissions:

	att
	bbn
	bu
	cmu
	cu-con
	cu-htk
	dra
	dragon
	ibm
	limsi
	lucent
	nyu
	ogi
	philips
	rutgers
	sri

You should place all of your recognition tests results in this
directory.  When scored results are sent back to you and subsequently
published, this directory name will be used to identify your
organization.

For each test system, create a sub-directory under your 'SITE'
directory identifying the system's name or key attribute.  The
sub-directory name is to consist of a free form system identification
string 'SYSID' chosen by you.  Place all files pertaining to hub/spoke
tests run using a particular system in the same SYSID directory.

 
Step 2: System Description and Recognition Hypothesis Output

For each hub or spoke test you run, you'll need to create a system
description file as outlined in Section 5.0, and several system output
files.  The output derived from each primary or contrastive experiment
must be placed in a file by itself.

The following file must be generated for each system used in the tests.
Only one copy of the file need be generated if the system is used for
multiple tests/conditions:

       sys-desc.txt (system description file)

       Place your system description in the file, 'sys-desc.txt'.  

The following file must be generated for each Hub 4 test condition:

       <TEST_SET>.ctm (system output hypothesis time-marked word)

       Create a system output file, '<TEST_SET>.hyp',
       for each primary or contrastive test (where, <TEST_SET>
       corresponds to the root portion of the index file name.)  The
       list of <TEST_SET> names is included below.


Step 3: Test Results Submission Protocol

Once you have structured all of your recognition results according to
the above format, you can then submit them to NIST.  Due to
international e-mail file size restrictions, test sites are permitted
to submit results to NIST using either email or anonymous ftp.
Continental US sites may use either method, but international sites
must use the 'ftp' method.  The following instructions assume that you
are using the UNIX operating system.  If you do not have access to
UNIX utilities or ftp, please contact NIST to make alternate
arrangements.

E-mail method: 
   First change directory to the directory immediately above the
   <SITE> directory.  Next, type the following:

      tar -cvf - ./<SITE> | compress | uuencode <SITE>-<SUBM_ID>.tar.Z | \
      mail -s "Nov96 CSR H4 test results <SITE>-<SUBM_ID>" \
      jon@jaguar.ncsl.nist.gov

      where,

	<SITE>	   is the name of the directory created in Step 1 to
                   identify your site.
        <SUBM_ID>  The submission number (e.g. your first submission would
                   be numbered '1', your second, '2', etc.)

Ftp method:
   First change directory to the directory immediately above the
   <SITE> directory.  Next, type the following command.

      tar -cvf - ./<SITE> | compress > <SITE>-<SUBM_ID>.tar.Z

      where,

	<SITE>	   is the name of the directory created in Step 1 to
                   identify your site.
        <SUBM_ID>  The submission number (e.g. your first submission would
                   be numbered '1', your second, '2', etc.)

   This command creates a single file containing all of your results.
   Next, ftp to jaguar.ncsl.nist.gov giving the username 'anonymous'
   and your e-mail address as the password.  After you are logged in,
   issue the following set of commands, (the prompt will be 'ftp>'):

      ftp> cd /pub/benchmark/nov96_csr	
      ftp> binary
      ftp> put <SITE>-<SUBM_ID>.tar.Z
      ftp> quit

   You've now submitted your recognition results to NIST.  The last thing
   you need to do is send an e-mail message to Jon Fiscus at
   'jon@jaguar.ncsl.nist.gov' notifying NIST of your submission.  Please
   include the name of your submission file in the message.

   Note: If you choose to submit your results in multiple shipments, please
         submit ONLY one set of results for a given test system/condition
         unless you've made other arrangements with NIST.  Otherwise,
         NIST will programmatically ignore duplicate files.


4.3  File and Directory Formats
-------------------------------

The following is the BNF directory structure format for CSR hypothesis
recognition results:

<SITE>/<SYSID>/<FILES>

   where,

   SITE ::= att | bbn | bu | cmu | ... (use above site codes)

   SYSID ::= (short system description ID, preferably <= 8 characters)

   FILES ::= sys-desc.txt | (system description including reference to
                             paper if applicable)
	     <TEST_SET>.ctm (file containing hypothesized words with time
			     marks for the H4 tests)

      where,
    
      TEST_SET ::= et96h4.pem | et96h4.uem
 
The time-marked hypothesis words for the H4 tests will be placed in a
single file, called "<TEST_SET>.ctm".  The CTM file format, is a
concatenation of time marks for each word in each broadcast.  Each
word token must have a broadcast id, channel identifier (1 in the case
of Hub-4), start time, duration, and case-insensitive word text.
Optionally a confidence score can be appended for each word.  The
start time must be is seconds and relative to the beginning of the
waveform file.  The broadcast id's for the Hub-4 corpus will be
the basename of the waveform file.

The file must be sorted by the first three columns: the first and the
second in ASCII order, and the third by a numeric order.  The UNIX
sort command: "sort +0 -1 +1 -2 +2nb -3" will sort the words into
appropriate order.

Lines beginning with ';;' are considered comments and are ignored.
Blank lines are also ignored.

Included below is an example:
 
          ;;
          ;;  Comments follow ';;'
          ;;
          ;;  The Blank lines are ignored
 
          ;;
          940401 1 11.34 0.2  YES -6.763
	  940401 1 12.00 0.34 YOU -12.384530
          940401 1 13.30 0.5  CAN 2.806418
          940401 1 17.50 0.2  AS 0.537922


================================ END OF FILE =================================