CALLHOME Japanese Lexicon Second Edition April 23, 2025 Linguistic Data Consortium 1. Overview =========== This is an updated release of the CALLHOME Japanese Lexicon (LDC96L17). The original CALLHOME lexicon was compiled by the Linguistic Data Consortium in support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), sponsored by the U.S. Department of Defense. This re-release retains the same 80,689 words and associated information (morphological, phonological, stress, lexical frequencies, etc) from the original release. However, the directory structure, file formats, and documentation have been updated to modern standards. 2. Directory structure ====================== - data/lexicon.tsv -- lexicon in TSV format - data/lexicon.dict -- a pronunciation dictionary derived from lexicon; in CMUdict format - docs/file.tbl -- listing of md5 checksums, sizes, dates, and file names - docs/README.txt -- this file; a top-level documentation of release - docs/lex_segm.txt -- documents the word segentation principles used during transcription and preparation of the lexicon - docs/pron.txt -- documents the conventions used by the pronunciation field - docs/morph_tags.txt -- a table of morphological tags used in the morphological analyses - tools/g2p -- grapheme-to-phoneme tools used to automatically generate pronunciations in the lexicon from the original release 3. Word segmentation ==================== Word segmentation principles for Japanese were formulated in collaboration with LVCSR CALLHOME contractors, especially Yoshiko Ito and Paul Bamberg at Dragon Systems. These principles are described in "docs/lex_segm.txt". Certain dialect words, tagged with "dia" in the morphology field, are exceptions to these principles; dialect-specific contractions never occur in uncontracted form, and are therefore listed in the lexicon in the way that most captures their productivity. 4. lexicon.tsv ============== This is a UTF-8 encoded TSV version of the lexicon originally distributed with LDC96L17. It contains one entry per line, each consisting of seven tab-delimited fields: - headword -- orthographic form in kanji or katakana (e.g., 民国); it may be hiragana in cases where the word is only ever written in hiragana - hiragana -- orthographic form in hiragana (e.g., みんこく) - romanization -- orthographic form in romaji (e.g., miNkoku) - pron -- pronunciation of the headword (e.g., miNkoku) - morph -- morphological analysis of the headword (e.g., noun) - train_freq -- frequency of the headword in the 80 training transcripts from CALLHOME Japanese Omnibus corpus - gloss -- glosses of the headword (e.g., China) The first three fields regarding orthography are self-explanatory. Information on the other four fields is given in the sections below. 4.1 Field 4: pron ----------------- This field contains the word's pronunciation and follows the principles set forth in "docs/pron.txt". If a headword has multiple pronunciations, they are listed sequentially on the same line, separated by "||". 4.2 Field 5: morph ------------------ This field contains the morphological analysis of the headword. If multiple morphological analyses exist, they are listed sequentially, separated by "||". For the full list of abbreviations used in the morphological tags, please consult "docs/morph_tags.txt". 4.3 Field 6: train_freq ----------------------- This field provides the frequency of the headword in the 80 training transcripts from the CALLHOME Japanese corpus. Frequency is computed according to the occurrence of the kanji, katakana, or hiragana sequence that comprises the headword according to the first field in the lexicon. 4.4 Field 7: gloss ------------------ Gloss of word. Alternative glosses in this column are separated by "/". In cases where multiple morphological analyses exist (see 4.2), glosses will be listed separately for each morphological analysis, separated by "||". E.g., 櫻 has two morphological analyses, the second of which has multiple glosses: Headword: 櫻 Romanji: sakura Morph analyses: prop || noun Glosses: Sakura || cherry blossom/cherry tree **NOTE** that the original lexicon contains errors for the glosses of some entries; no attempt has been made to detect or clean these up for this release. 5. lexicon.dict =============== This is a CMUdict format version of "lexicon.tsv". It consists of one pronunciation per line, each line having the form: \t where: - WORD -- orthographic representation of word - PRON -- a single pronunciation of the word, expressed as a sequence of phone symbols 6. Contacts =========== If you have questions about this data release, please contact the following LDC personnel: Neville Ryant