-----------------------------------------------------------
        Description of the CallHome telephone speech and transcript
                          corpus for Japanese
        -----------------------------------------------------------


CONTENTS

        1. Summary abstract
        2. Data acquisition
        3. Data verification
        4. Speaker demographics
	5. Word segmentation
	6. Data transcription - General
	6.A. Data transcription - Japanese-specific
	6.B. Japanese transcription symbol table

-----------------------------------------------------------------------
1.  Summary abstract

The CallHome Japanese corpus of telephone speech was collected and
transcribed by the Linguistic Data Consortium primarily in support of
the project on Large Vocabulary Conversational Speech Recognition
(LVCSR), sponsored by the U.S. Department of Defense.

This release of the CallHome Japanese corpus consists of 120
unscripted telephone conversations between native speakers of
Japanese.  The transcripts cover a contiguous 5 or 10 minute segment
(see section 2 below) taken from a recorded conversation lasting up to
30 minutes.  All speakers were aware that they were being recorded.
They were given no guidelines concerning what they should talk about.
Once a caller was recruited to participate, he/she was given a free
choice of whom to call.  Most participants called family members or
close friends overseas.  All calls originated in North America.  The
distribution of call destinations can be found in the file
"spkrinfo.tbl".

The transcripts are timestamped by speaker turn for alignment with the
speech signal, and are provided in standard orthography.

-----------------------------------------------------------------------
2.  Data acquisition

Speakers were solicited by the LDC to participate in this telephone
speech collection effort through personal contacts and the internet.
A total of 200 call originators were found, each of whom placed a
telephone call via a toll-free robot operator maintained originally by
Rutgers University, and later by the LDC.  Access to the robot
operator was possible via a unique Personal Identification Number
(PIN) issued by the recruiting staff at Rutgers or the LDC when the
caller enrolled in the project.  The participants were made aware that
their telephone call would be recorded, as were the call recipients.
The call was allowed only if both parties agreed to being recorded.
Each caller was allowed to talk up to 30 minutes.  Each caller was
allowed to place only one telephone call.

In all, 200 calls were transcribed.  Of these, 80 have been designated
as training calls, 20 as development test calls, and 100 as evaluation
test calls.  For each of the training and development test calls, a
contiguous 10-minute region was selected for transcription; for the
evaluation test calls, a 5-minute region was transcribed.  For the
present publication, only 20 of the evaluation test calls are being
released; the remaining 80 test calls are being held in reserve for
future LVCSR benchmark tests.

-----------------------------------------------------------------------
3.  Data verification

After a successful call was completed, a human audit of each telephone
call was conducted to verify that the proper language was spoken, to
check the quality of the recording, and to select and describe the
region to be transcribed.  The description of the transcribed region
provides information about channel quality, number of speakers, their
gender, and other attributes.  The information from this audit may be
found in the file "callinfo.tbl", and its contents are described in
greater detail in "callinfo.doc".

-----------------------------------------------------------------------
4.  Speaker demographics

Information on speaker demographics can be found in the file
spkrinfo.tbl, whose contents are described in the file spkrinfo.doc.

-----------------------------------------------------------------------
5.  Word Segmentation

Segmentation of the Japanese transcripts was performed by hand at the
LDC by Megumi Kobayashi and Masayo Kaneko.  Word segmentation
principles for Japanese were formulated in collaboration with LVCSR
Callhome contractors, especially Yoshiko Ito and Paul Bamberg at
Dragon Systems, and are as follows (note, however, that words tagged
as dialect-specific, "dia", may be exceptions to the principles
below):

1.  Compounds

A compound is treated as a unitary word (i.e., not segmented).

2.  Conventionalized expressions

Common expressions in conversational Japanese are treated as
unitary.

3.  noun+'suru'

'suru' is separated from the preceding noun except in cases
where its phonological form changes in combination with the
noun, i.e. +[suru] -> +[zuru].  Cases of noun+'zuru' are
generated by the LDC transducer.

4.  noun+'na'

'na' is separated from the preceding noun; nouns and 'na' are listed
separately in the lexicon.  However, all nouns that can take +na to
become adjectives are tagged to indicate so.

5.  Auxiliaries

The following verb/adjective auxiliaries are treated as separate
words.  All seven verb stems that can combine with them are listed
separately in the lexicon as well.

        deshita         mashita         masu            nakereba
        taku            takatta         nakatta         nai	tai

6.  Particles

Particles are treated as separate words, except that multi-
particle combinations are not segmented (e.g., 'wa', 'dewa').

7.  Honorifics

Honorifics are treated as separate from the preceding word.

8.  Rendaku

Words undergoing rendaku, or sequential voicing, are treated as
unsegmented compounds.

9.  Count forms

Irregular count forms, in which there is phonological interaction
between the counter and what follows, are treated as unitary words
and not segmented.

10.  Contracted forms

Contractions are not transcribed as unitary words, but rather (1)
listed as separate words in the lexicon (e.g., "itte" and "oku"
separately for "ittoku"), and (2) segmented as two words in the
Callhome transcripts, followed by the contracted form and a tag in
double brackets (see 6.A.2. below).  The exception to this principle
is dialect words (tagged as "dia"), which never occur in uncontracted
form; they are transcribed in a way which most captures their
productivity.

-----------------------------------------------------------------------

6.  Data transcription - General

The initial transcription was carried out by Texas Instruments;
hand-segmentation, word standardization, kanji maximization, and other
improvements were performed at the LDC.  Below are the general
transcription instructions given to transcribers by TI:


     CALLHOME TRANSCRIPTION CONVENTIONS - General (TI)


1.  Transcribe "verbatim", without correcting grammatical errors.

2.  Do not try to imitate pronunciation details, including accents and
    mispronunciations.  Write the words that you believe the speaker
    intended, using standard orthography.

3.  Speaker identification:

    Label each speaker with A: or B: at the beginning of the line.

    Use A: for the lower speaker and B: for the upper speaker in the waveform.
    (A will be the person calling from the U.S., and B the person overseas.)

    If there is more than one speaker at one end of the conversation (e.g.
    the telephone is passed around, or multiple extensions in use), add
    numbers for each new speaker:

        B:  (the first speaker on side B)
        B1: (a different speaker)
        B2: (yet another speaker)

    Try to label the speakers consistently.  For example, if the first
    speaker returns, use "B:" again.

4.  Speaker turns:

    Begin each speaker turn on a new line.  Do not put carriage returns
    within a speaker line.  (Don't worry if the screen shows a break in the
    middle of a word.)

    Each speaker turn begins and ends with a pause.  That is, each continuous
    stretch of speech is transcribed as one turn.  Any simultaneous speech
    on the other channel is transcribed separately, after the current turn
    is completed.

    Example: (x indicates speech, - indicates silence)

    channel B:     xxxxxxxxxxxxxxxxxxxxxxxxx---------xxxxxxxxx--
    channel A:     -------xxx-----xxx-----xxxxxxxxxxxxxxx--------
    time           0      1       2       3      4      5

    sequence of turns in the transcription (times are not exact):
    0.1 3.1 B:
    1.0 1.3 A:
    2.0 2.3 A:
    3.0 5.0 A:
    4.6 5.9 B:

    A "turn" consisting entirely of noise is transcribed only if it is
    a vocal tract noise from the talker (laugh, cough, etc.) - see 7 below.
    Channel noise is NOT transcribed.


5.  Simultaneous speech on the same channel:

    If two people are speaking on the same channel (an extension phone or
    a speaker phone), and if they speak simultaneously, put pound signs #
    around the words spoken simultaneously.

    Example:

        B:  #Oh, how interesting.#
        B1: #That's good news.#

    If only part of the utterance is simultaneous, mark only the part that
    is simultaneous, but transcribe the entire utterance as one turn.
    Put the other speaker's utterance on the next line, with its times.

    Example:

    10.5 12.5 B:  Well, I agree with you.  #I think# you're right.
    11.5 12.0 B1: #Oh yes, yes.#

    Note that # is used only for simultaneous speech on the same channel.
    Simultaneous speech on different channels is identifiable as such by
    reference to the time marks.

6.  Partial words:

    If a speaker does not finish a word, write as much as you heard
    and end it with a hyphen.  Put a space after the hyphen, but no space
    before it.

7.  Non-speech sounds:

    a)  Sounds made by the talker:

    When the participants in the conversation make sounds that are not
    speech, indicate them using a label between braces, for example:

        {cough}
        {laugh}

    Example:

        A:  Oh, that's funny. {laugh} {cough} Excuse me, I have a cold.

    If the talker makes one of these sounds as an entire turn, transcribe
    it and show the times, for example:

        340.0 342.0 A: {laugh}

    b)  Other sounds:

    Mark other sounds using brackets [ ].  This includes background
    noises, background speech, and noises on the line.  Mark these sounds
    only when they are clearly audible and about as loud as the speech.
    If they are hard to hear, or quieter than the speech, then ignore them.

    Also, do not transcribe noises that occur when no one on that channel
    is speaking, even if the noises are loud and clear.  For example, if
    B is speaking and there is a loud noise on channel A (which is not made
    by speaker A), do not transcribe it.

    Examples:

    A clearly audible noise occurs during speech:

        A:  Yes [noise].

    If the event being described lasts longer than a few words, then
    indicate the beginning in braces [ ], and the end in braces with a
    "/", [/ ].  For intermittent sounds, mark the beginning and end of the
    intermittent occurrence of the sound - not the beginning and end of
    each individual occurrence.

    Example:

        A:  Well, it all depends, uh, on, you know, [baby crying] how the
        family reacts.  I mean, it can be a positive or a negative thing,
        you know?
        B:  Yes, you're right.
        A:  So it's difficult to say what's best sometimes. [/baby crying]

    Note: Be sure to mark the end on the channel where it occurred (A, in
    the example above).  If the noise ends while the other speaker is
    talking, mark it at the end of the turn of the speaker on the same
    channel.  For example, if the baby stops crying while B is talking:

        A:  Well, it all depends, uh, on, you know, [baby crying] how the
        family reacts.  I mean, it can be a positive or a negative thing,
        you know? [/baby crying]
        B:  Yes, you're right.
        A:  So it's difficult to say what's best sometimes.

8.  Speech to someone in the background:

    If the speaker talks to someone in the background, put the speech between
    double slash marks.

    Examples:

        A: Just a minute.  // Mary, please bring me a pencil. //

        A: Sm //una llamada de// ?quieres hablar un poquito con tu papa?

9.  When a word or phrase is not clear, type double parentheses ((  ))
    around what you think you hear.  If there is no way to tell what the
    speaker said, leave one blank space between the double parentheses,
    indicating speech has been left out because it was unintelligible.

    Examples:

        A:  So when I finally did ((take up)) the violin, I
        progressed pretty quickly in the beginning.

        B:  Of course, that was in college which was a long time
        ago, so (( )) I remember.

10.  Comments

    To put a comment in the transcription, use double square brackets:
    [[comment]]

    Comments should be used very sparingly - only when there is no other
    way to indicate some unusual event.  Notations describing noises should
    use single brackets, not double brackets (see #7).

    Examples of comments:

    [[speaker is singing]]
    [[speaker imitates a little child]]
    [[previous word is exceptionally prolonged]]

    Comments may be used to indicate the reason for unintelligible speech.
    Example:

        (( )) [[distortion]]

    However, use such comments sparingly.  If there is consistent distortion,
    note it on the conversation summary sheet and do NOT put it in the
    transcription every time.  The same is true for mumbling, rapid speech,
    etc.  In other words, use comments only for unusual cases.

-----------------------------------------------------------------------

6.A.  Data transcription - Japanese-specific

1.  Orthography

i.  Kanji.  

Kanji representations have been maximized in the transcripts.
However, when a kanji representation appeared out of date, the word
was rendered in hiragana.

Japanese speech and punctuation should be transcribed in EUC; all
other information is in ASCII.

ii.  Proper names.

Well-known names (e.g. authors, celebrities) are rendered in the kanji
by which they are known.  For less well-known names, the first name is
given in hiragana and the last name in the most commonly used kanji.

iii. Auxiliary verbs.

Auxiliary verbs are represented in hiragana in principle
(e.g.,existential iru/aru, morau, nai, kuru/iku, etc.); however when
they are acting as main verbs, they may be represented in kanji
(e.g.,kuru/iku), unless it is archaic.

iv. Numbers.

Numbers are represented in kanji.

v. English.

English alphabet symbols are retained (e.g., U C L A).


2.  Spelling

In principle, spelling variants are regularized so that there is only
one representation: either kanji (katakana) or hiragana.  Homonyms,
however, inevitably remain.  Non-contrastive variable length has been
eliminated as much as possible in order to eliminate spelling
variants.

Colloquial, contracted, and dialect-specific words and expressions
have been rendered in standardized, uncontracted form; however,
whenever the transcribers at TI had originally provided a more
phonetic-based representation, that version was retained and placed in
double brackets with a tag, and the standard or uncontracted form was
marked with the symbol "@", e.g.:

	@standard[[phonetic, dia]]
	@uncontracted words[[contracted, con]]

This is so that the phonetic information, which may be useful for
modelling pronunciation, is retained.  The tags used are:

	dia	dialect word
	col	colloquial word
	con	contracted form

-----------------------------------------------------------------------

6.B.  Japanese transcription symbol table


    {text}		sound made by the talker

			{laugh} {cough} {sneeze} {breath}

    [text]		sound not made by the talker (background or channel)
			
			[distortion]    [background noise]      [buzz]

    [/text]		end of continuous or intermittent sound not made by
			the talker (beginning marked with previous [text])

    [[text]]		comment on preceding or following text

    @text [[text, tag]]	

			the word following "@" is the standardized
			headword; the word in double brackets [[ ]] is the 
			pronunciation-based colloquial or dialect word, or
			contractions as originally transcribed by TI.  The
			words following "@" are almost all in the lexicon 
			(the ones that aren't are excluded because no 
			standardized representation exists); the words 
			in double brackets are mostly excluded from the 
			lexicon.  The tags include:

			dia		dialect word
			con		contraction
			col		colloquial expression/spelling

    ((text))            unintelligible; text is best guess at transcription

                        ((coffee klatch))

    (( ))               unintelligible; can't even guess text

                        (( ))


    <language text>     speech in another language

                        <English California>

    <? (( ))>           ? indicates unrecognized language; (( )) indicates
                        untranscribable speech

                        <? ayo canoli>  <? (( ))>

    text=		partial word
                        
			absolu=

    #text#              simultaneous speech on the same channel
                        (simultaneous speech on different channels is not
                        explicitly marked, but is identifiable as such by
                        reference to time marks)

    //text//            aside (talker addressing someone in background)

                        //quit it, I'm talking to your sister!//

    +text+              "Japanized" foreign word or phrase, i.e., foreign
                        word or phrase assimilated to Japanese phonology.
                        Used for idiosyncratic cases; loan words in customary
                        usage are not marked.

   **text**             idiosyncratic word, not in common use, or a mis-
			pronunciation; not included in lexicon.

                        **poodle-ish**

    %text               This symbol flags non-lexemes, which are
                        general hesitation sounds.  

                        %mm %uh

-----------------------------------------------------------------------