-----------------------------------------------------------
	Description of the Callfriend telephone speech collection procedure for Korean.
	Description of the transcription procedure.
	-----------------------------------------------------------


CONTENTS

	1. Summary abstract
	2. Data acquisition
	3. Data verification
	4. Speaker demographics
	5. Word segmentation
	6. Data transcription - General
	   6.A. Data transcription - Korean-specific
	   6.B. Korean transcription symbol table


-----------------------------------------------------------------------
1.  Summary abstract

	The Callfriend Korean telephone speech was collected
by the Linguistic Data Consortium primarily in support of the project on 
Language Identification (LID), sponsored by the U.S. Department of Defense.
The calls were later transcribed for use in other projects.

     The recorded conversations last up to 30 minutes. All speakers were
aware that they were being recorded. They were given no guidelines
concerning what they should talk about.  Once a caller was recruited
to participate, he/she was given a free choice of whom to call.  Most
participants called family members or close friends.  All
calls originated in either the United States or Canada.


-----------------------------------------------------------------------
2.  Data acquisition

	Speakers were solicited by the LDC to participate in this
telephone speech collection effort via the internet, publications
(advertisements), and personal contacts.  A total of slightly over 100 call
originators were found, each of whom placed a telephone call via a
toll-free robot operator maintained by the LDC.  Access to the robot
operator was possible via a unique Personal Identification Number
(PIN) issued by the recruiting staff at the LDC when the caller
enrolled in the project.  The participants were made aware that their
telephone call would be recorded, as were the call recipients.  The
call was allowed only if both parties agreed to being recorded.  Each
caller was allowed to talk up to 30 minutes.  Upon successful
completion of the call, the caller was paid $20 (in addition to making
a free long-distance telephone call).  Each caller was allowed to
place only one telephone call.


-----------------------------------------------------------------------
3.  Data verification

	After a successful call was completed, a human audit of each
telephone call was conducted to verify that the proper language was
spoken, and to check the quality of the recording.  The information from
this audit may be found in the file "callinfo.tbl", and its contents
are described in greater detail in "callinfo.txt".

-----------------------------------------------------------------------
4.  Speaker demographics

        Information on speaker demographics can be found in the file
"spkrinfo.tbl", whose contents are described in the file "spkrinfo.txt".
-----------------------------------------------------------------------

5. Word Segmentation

Segmentation of the Korean transcripts were performed by hand at the LDC
by Eon-Suk Ko, Jim Oh, Tae-Seung Yoo, Jacqueline Suyeun Pyun, and
Grace Jung. Word segmentation principles for Korean were formulated by
Eon-Suk Ko in consulation with Stephanie Strassel, Nii Martey,
Chris Cieri, and David Graff as well as existing Korean dictionaries. They
are as follows:

5.1. Compounds

If the meaning of the compound is predictable from the components, they
are treated as separate words. Otherwise, they are treated as single
words.

5.2. Frozen Expressions

Common expressions in conversational Korean are treated as units.

5.3. Light verb construction (noun+"ha")

The light verb 'ha' is separated from the preceding noun. 

5.4. Auxiliaries

Verbal and adjectival auxiliaries are treated as separate words from the
main predicate. 

5.5. Dependent nouns

Dependent nouns are considered as separate words. However, dependent words
that are used with numbers or months names are incorporated into the
preceding words. 

5.6. Demonstratives

Demonstratives are normally considered as separate words from the
following nouns. However, when they produce high frequency vocabulary
items by combining with several dependent nouns such as 'kes' as in
'i.kes' or 'tt'e' as in 'ku.tte', they are segmented with the following
noun. 

5.7. Contracted forms

Contracted forms of the suffixes such as 'nun' and 'lul' are
preserved, as well as nouns whose contracted form is considered
grammatical. However, contractions of other suffixes or nouns that result
in ungrammaticality are spelled out. 

(example)

nan (< na + nun) : acceptable contraction
       I    Topic
ku.chi (<ku.leh.ci) : unacceptable contraction
         so     Q

-----------------------------------------------------------------------
6.A. Data transcription - Korean-specific

6.A.1. Orthography

The transcription followed the orthographic form of spoken words instead
of the actual pronunciation in the cases of mismatching. When the mismatch
between the written form and the actual pronunciation is beyond what can
be predicted by the pronunciation dictionary, it was marked with a '+'
symbol. 

6.A.2. Foreign words (mostly English)

Common English words such as 'OKay' or 'New York' are considered as
loanwords and transcribed in Korean. If, however, the word is uncommon or
pronounced with native English accent, it is tagged with a foreign word
mark. 

6.A.3. Dialect specific words

Vocabulary items that have variable forms in specific dialects are 
regularized to a standard form and tagged with the '&' symbol.

-----------------------------------------------------------------------
6.B. Korean transcription symbol table

{text}              sound made by the speaker

                        {laugh} {cough} {sneeze} {breath} {lipsmack}

[text]              sound not made by the speaker (background or
		    channel). When the noise is continuous, the beginning
		    and the end is marked with [text/] and [/text],
		    respectively.

                        [distortion] [background] [static]

((text))            unintelligible; text is best guess at transcription

			((closed))
  
(( ))               unintelligible; can't even guess text

			(( ))

<as> text </as>     talker addressing someone in the background. 


<ov> text </ov>     overlapping speech in the same channel. 

[[skip]]            a substantial portion of speech that has proven to be
		    too difficult to transcribe.

<language text>     speech in another language

		       <English sue>

text-               partial word

                        conveni-

*text               idiosyncratic word, not in common use, or a mispronunciation; not included in lexicon.

                        **poodle-ish**

%text               non-lexemes, which are hesitation sounds or responses
 		    during conversation that do not use clear word forms.
                        
                        %mm %uh

&text		    dialect specific pronunciation of lexical items. 


@text		    acronyms that are pronounced as a single word.

		        @NASA

=text               suffixes that are normally attached to roots but
                    are separated due to context

			<English top> =이야?

~text	            acronyms that are pronounced as a sequence of individual letters.

			~FBI

^text		    proper names and place names. only proper names are tagged with '^' in a proper name phrase. 

		        ^New ^York,  ^Barnes and ^Nobles
-----------------------------------------------------------------------