Kobayashi, Megumi, S. Crist, M. Kaneko, and C. McLemore. 1997. LDC Japanese Lexicon. Philadelphia: Linguistic Data Consortium, University of Pennsylvania. ----------------------------------------------------------- Description of the LDC Japanese lexicon ----------------------------------------------------------- CONTENTS 1. Summary abstract 2. Word segmentation 3. Lexicon information fields 4. Pronunciation 5. Morphological tags 6. Frequency information 7. Glosses ----------------------------------------------------------------------- 1. Summary abstract The LDC Japanese lexicon was compiled primarily for support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), sponsored by the U.S. Department of Defense. The LDC Japanese lexicon consists of 80,688 words, and contains separate information fields with morphological, phonological, and stress information for each word. ----------------------------------------------------------------------- 2. Word segmentation Word segmentation principles for Japanese were formulated in collaboration with LVCSR Callhome contractors, especially Yoshiko Ito and Paul Bamberg at Dragon Systems. Certain dialect words, tagged with "dia" in the morphology field, are treated differently from the principles described below (for example, dialect-specific contractions never occur in uncontracted form, and are therefore listed in the lexicon in the way that most captures their productivity). In general, 1. Compounds A compound is treated as a unitary word (i.e., not segmented). 2. Conventionalized expressions Common expressions in conversational Japanese are treated as unitary. 3. noun+'suru' 'suru' is separated from the preceding noun except in cases where its phonological form changes in combination with the noun, i.e. +[suru] -> +[zuru]. Cases of noun+'zuru' are generated by the LDC transducer. 4. noun+'na' 'na' is separated from the preceding noun; nouns and 'na' are listed separately in the lexicon. However, all nouns that can take +na to become adjectives are tagged in the morphological field to indicate so. 5. Auxiliaries The following verb/adjective auxiliaries are treated as separate words. All seven verb stems that can combine with them are listed separately in the lexicon as well. deshita mashita masu nakereba taku takatta nakatta nai tai 6. Particles Particles are treated as separate words, except that multi- particle combinations are not segmented (e.g., 'wa', 'dewa'). 7. Honorifics Honorifics are treated as separate from the preceding word. 8. Rendaku Words undergoing rendaku, or sequential voicing, are treated as unsegmented compounds. 9. Count forms Irregular count forms, in which there is phonological interaction between the counter and what follows, are treated as unitary words and not segmented. 10. Contracted forms Contractions are not transcribed as unitary words, but rather (1) listed as separate words in the lexicon (e.g., "itte" and "oku" separately for "ittoku"), and (2) segmented as two words in the Callhome transcripts, followed by a comment such as [[contraction]]. Note that dialect-specific forms are exceptions to this principle. ----------------------------------------------------------------------- 3. Lexicon information fields The LDC Japanese lexicon contains seven tab-separated information fields: Field 1: orthographic form in kanji or katakana, or hiragana if the word is only written in hiragana (headword) Field 2: orthographic form in hiragana Field 3: orthographic form in romaji Field 4: pronunciation of the headword Field 5: morphological analysis of the headword Field 6: frequency of the headword in the 80 training transcripts from the LDC Japanese Callhome telephone speech corpus Field 7: glosses of the headword Fields 1-3 containing orthography should be self-explanatory. Information on the other four fields is given in the sections below. ----------------------------------------------------------------------- 4. Pronunciation The fourth field of the lexicon contains representations of pronunciations, according to the following allophone table: Allophone Roomaji IPA or description (Hepburn) a a IPA a i i IPA i u u unrounded high back vowel e e unrounded front mid lax vowel o o IPA o k k IPA k K ky IPA kj g g IPA g G gy IPA gj s s IPA s S sh voiceless alveopalatal fricative z z IPA z J j voiced alveopalatal affricate t t IPA t C ch voiceless alveopalatal affricate c ts IPA ts d d IPA d D j voiced alveopalatal affricate n n IPA n Y ny IPA nj h h IPA h H hy IPA hj f f voiceless bilabial fricative b b IPA b B by IPA bj p p IPA p P py IPA pj m m IPA m M my IPA mj y y IPA j r r coronal flap R ry palatalized coronal flap w w IPA w N N placeless nasal coda ? t glottal coda The following romaji represenations were automatically converted, in the pronunciation field, to the corresponding pronunciation representation shown below: romaji rep. pronunciation ei ee cf. comments below ou oo (should not precede oxu-u) cf. comments below di ji du zu uxe ue uxi ui uxo uo uxa ua ixe ie oxu u (should not precede du-zu) exi i (should not precede di-ji) exyu yu v b cf. comments below Comments on the pronunciation field: (1) The ei-ee, ou-oo conversions listed above were hand-checked, because there are exceptions. ei-ee or oi-oo conversions do not apply when e/i and o/i are separated by a morphological boundary, nor do they generally apply to foreign words (e.g., supein 'Spain'). When both ei-ee, and ou-oo appeared acceptable, both pronunciations were given in the pronunciation field, separated by /. (2) Some orthographic representations in Japanese no longer match the current pronunciation; in particular, particles have often undergone some reduction (e.g., ha-wa, wo-o). Pronunciations for these cases have been added by hand. (3) The foreign sound v(a) is sometimes written in a non-standard katakana in an effort to preserve the original sound, but in fact, it's generally pronounced as /b(a)/ (e.g. adviser -> adobaizaa), and has been represented as such in the pronunciation field. (4) Variant pronunciations of individual words have not been added, but some systematic variants have been given in the pronunciation field: 1. The i-y alternation and compensatory lengthening before /u/: simpojiumu/simpojyuumu 'symposium' 2. Certain cases of onset insertion: baai/bawai 'case' 3. Variation in the treatment of 'ia' in foreign words: roshia/roshiya 'Russia' girishia/girisha 'Greece' Further comments on pronunciation: The following phonological phenomena were not incorporated into the pronunciation field, but are listed here in case they may be of use. (1) Moraic N usually assimilates to the following consonant: N-bilabial -> m-lab. toNbo tombo /dragonfly/ N-coronal -> n-cor. kaNtaN kantaN /easy/ N-dorsal -> *ng-dor. keNka kengka /fight/ *ng:velar nasal N -> N kaNtaN kantaN /easy/ Also, the velar nasal 'ng' and velar 'g' alternate in low level phonetic realization. (2) i,u devoicing: in general, these vowels are devoiced when they occur between /k,s,t,h(f),p,sh/ and before /k,s,t,h(f),p/; for example, /i/ in kiku, /u/ in gakusha. However, this process has exceptions too numerous to list exhaustively. The most notable exceptions are: -when the vowel is in H accent mora followed by L accent mora. /i/ in shison is not devoiced -when there are two adjacent moras with the same condition, only one of them is devoiced. first /i/ devoiced, second /i/ intact in kikikata first /i/ intact, shi devoiced in rekishiteki -word final k,s,t,p,sh with L accent ki in aki, su in arimasu not devoiced ----------------------------------------------------------------------- 5. Morphological tags The fifth field of the Japanese lexicon contains morphological information about the headword. The abbreviations used are explained below. Each word has only one entry, with different morphological tags listed sequentially, separated by a comma, on the same line. The inventory of morphological tags is: 1 An ichidan (one-step) verb a-stem One of the forms of five-step verbs. e.g. yomu --> yoma (nai) adj The residue of adjectives not included in 'na' and 'i-adj' adv Adverb alt Used for -tari suffix, which means to alternately do one thing and then another aux Auxiliary verb b5 A five-step (godan) verb in which the final stem consonant is b (e.g. tob-u) caus Causative cond1 First conditional type e.g. ike-ba cond2 Second conditional type e.g. itta-ra conj Conjunction contr Contraction cop Copula dem Demonstrative desire Used for the suffix -tai, which means 'want to' dia Dialect form of a standard word (followed by dialect name if known) e-stem One of the forms of five-step verbs. e.g. yome --> yome g5 A five-step (godan) verb in which the final stem consonant is g (e.g. osog-u) go Takes the prefix go- as an honorific honor Honorific form i-adj A regular i-adjective i-stem One of the forms of five-step verbs e.g. yomu --> yomi (masu) imp Imperative infer For example, 'deshou' ('(I) suppose/deduce/infer') interj Interjection interrog Interrogative (WH) word k5 A five-step (godan) verb in which the final stem consonant is k (e.g. ok-u) kansai Kansai dialect kyuushuu Kyuushuu dialect lets First person plural imperative, e.g. yomou, yomimashou 'let's read' m5 A five-step (godan) verb in which the final stem consonant is m (e.g. yom-u) n5 A five-step (godan) verb in which the final stem consonant is n (e.g. shin-u) na A na-adjective neg Negative noun Noun num Number or counter o-stem One the the forms of five-step verbs e.g. yomu --> yomou o Takes o- prefix to form honorific form onom Onomoatopoeia part Particle pass Passive past Past tense phrase A fossilized phrase which we treat as one word pol Polite form pot Potential form prefix Prefix pres Present tense pro Pronoun prop Proper name r5 A five-step (godan) verb in which the final stem consonant is r (e.g. tor-u) s5 A five-step (godan) verb in which the final stem consonant is s (e.g. os-u) si A noun which can be followed by suru where the two are not separate (wo cannot intervene) suffix Suffix t5 A five-step (godan) verb in which the final stem consonant is t (e.g. mats-u, mat-a-nai) ta-stem One of the forms of five-step verbs e.g. yomu --> yonda (This is not traditionally considered one of the five steps, but we generate it anyway) te-stem One of the forms of five-step verbs e.g. yomu --> yonde (This is not traditionally considered one of the five steps, but we generate it anyway) u-stem One of the forms of five-step verbs; this is the citation form vs A noun which can be followed by suru where the two are separate (wo can intevene) w5 A five-step (godan) verb in which the final stem consonant is w (e.g. omo-u, omow-a-nai) zi Like si (A noun which can be followed by suru where the two are not separate), except that suru undergoes phonological change --> zuru throughout conjugation. ----------------------------------------------------------------------- 6. Frequency The frequency given in field six is with respect to the 80 training transcripts in the LDC Japanese Callhome corpus. Frequency is computed according to the occurrence of the kanji, katakana, or hiragana sequence that comprises the word according to the first field in the lexicon. Word frequency counts in other corpora were not available because of the necessity of regularizing word segmentation (by hand, at this point). ----------------------------------------------------------------------- 7. Glosses Alternate glosses given in field seven of the lexicon are separated by a single slash; glosses corresponding to different parts-of-speech, as listed in the tag column, are separated by double slashes. -----------------------------------------------------------------------