----------------------------------------------------------- Description of the Callfriend telephone speech collection procedure for Korean. Description of the transcription procedure. ----------------------------------------------------------- CONTENTS 1. Summary abstract 2. Data acquisition 3. Data verification 4. Speaker demographics 5. Word segmentation 6. Data transcription - General 6.A. Data transcription - Korean-specific 6.B. Korean transcription symbol table ----------------------------------------------------------------------- 1. Summary abstract The Callfriend Korean telephone speech was collected by the Linguistic Data Consortium primarily in support of the project on Language Identification (LID), sponsored by the U.S. Department of Defense. The calls were later transcribed for use in other projects. The recorded conversations last up to 30 minutes. All speakers were aware that they were being recorded. They were given no guidelines concerning what they should talk about. Once a caller was recruited to participate, he/she was given a free choice of whom to call. Most participants called family members or close friends. All calls originated in either the United States or Canada. ----------------------------------------------------------------------- 2. Data acquisition Speakers were solicited by the LDC to participate in this telephone speech collection effort via the internet, publications (advertisements), and personal contacts. A total of slightly over 100 call originators were found, each of whom placed a telephone call via a toll-free robot operator maintained by the LDC. Access to the robot operator was possible via a unique Personal Identification Number (PIN) issued by the recruiting staff at the LDC when the caller enrolled in the project. The participants were made aware that their telephone call would be recorded, as were the call recipients. The call was allowed only if both parties agreed to being recorded. Each caller was allowed to talk up to 30 minutes. Upon successful completion of the call, the caller was paid $20 (in addition to making a free long-distance telephone call). Each caller was allowed to place only one telephone call. ----------------------------------------------------------------------- 3. Data verification After a successful call was completed, a human audit of each telephone call was conducted to verify that the proper language was spoken, and to check the quality of the recording. The information from this audit may be found in the file "callinfo.tbl", and its contents are described in greater detail in "callinfo.txt". ----------------------------------------------------------------------- 4. Speaker demographics Information on speaker demographics can be found in the file "spkrinfo.tbl", whose contents are described in the file "spkrinfo.txt". ----------------------------------------------------------------------- 5. Word Segmentation Segmentation of the Korean transcripts were performed by hand at the LDC by Eon-Suk Ko, Jim Oh, Tae-Seung Yoo, Jacqueline Suyeun Pyun, and Grace Jung. Word segmentation principles for Korean were formulated by Eon-Suk Ko in consulation with Stephanie Strassel, Nii Martey, Chris Cieri, and David Graff as well as existing Korean dictionaries. They are as follows: 5.1. Compounds If the meaning of the compound is predictable from the components, they are treated as separate words. Otherwise, they are treated as single words. 5.2. Frozen Expressions Common expressions in conversational Korean are treated as units. 5.3. Light verb construction (noun+"ha") The light verb 'ha' is separated from the preceding noun. 5.4. Auxiliaries Verbal and adjectival auxiliaries are treated as separate words from the main predicate. 5.5. Dependent nouns Dependent nouns are considered as separate words. However, dependent words that are used with numbers or months names are incorporated into the preceding words. 5.6. Demonstratives Demonstratives are normally considered as separate words from the following nouns. However, when they produce high frequency vocabulary items by combining with several dependent nouns such as 'kes' as in 'i.kes' or 'tt'e' as in 'ku.tte', they are segmented with the following noun. 5.7. Contracted forms Contracted forms of the suffixes such as 'nun' and 'lul' are preserved, as well as nouns whose contracted form is considered grammatical. However, contractions of other suffixes or nouns that result in ungrammaticality are spelled out. (example) nan (< na + nun) : acceptable contraction I Topic ku.chi ( text talker addressing someone in the background. text overlapping speech in the same channel. [[skip]] a substantial portion of speech that has proven to be too difficult to transcribe. speech in another language text- partial word conveni- *text idiosyncratic word, not in common use, or a mispronunciation; not included in lexicon. **poodle-ish** %text non-lexemes, which are hesitation sounds or responses during conversation that do not use clear word forms. %mm %uh &text dialect specific pronunciation of lexical items. @text acronyms that are pronounced as a single word. @NASA =text suffixes that are normally attached to roots but are separated due to context =À̾ß? ~text acronyms that are pronounced as a sequence of individual letters. ~FBI ^text proper names and place names. only proper names are tagged with '^' in a proper name phrase. ^New ^York, ^Barnes and ^Nobles -----------------------------------------------------------------------