CALLHOME German Lexicon Second Edition April 23, 2025 Linguistic Data Consortium 1. Overview =========== This is an updated release of the CALLHOME German Lexicon (LDC97L18). The original CALLHOME lexicon was compiled by the Linguistic Data Consortium in support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), sponsored by the U.S. Department of Defense. This re-release retains the same 318,809 words and associated information (morphological, phonological, stress, lexical frequencies, etc) from the original release. However, the directory structure, file formats, and documentation have been updated to modern standards. 2. Directory structure ====================== - data/lexicon.tsv -- lexicon in TSV format - data/lexicon.dict -- a pronunciation dictionary derived from lexicon; in CMUdict format - docs/file.tbl -- listing of md5 checksums, sizes, dates, and file names - docs/README.txt -- this file; a top-level documentation of release - docs/pron.txt -- documents the conventions used by the pronunciation field - docs/morph_tags.txt -- documents the morphological tags used in the morphological analyses 3. Construction =============== Of the 318,809 words in the lexicon, 315,501 are adapted from the CELEX German lexicon produced by The Centre for Lexical Information, Max Planck Institute for Psycholinguistics in Nijmigen, and 3,308 additional words come from the 80 training and 20 development test (devtest) transcripts (10 minutes each) from the LDC German CallHome telephone speech corpus. 4. lexicon.tsv ============== This is a UTF-8 encoded TSV version of the lexicon originally distributed with LDC97L18. It contains one entry per line, each consisting of seven tab-delimited fields: - headword -- orthographic form (e.g., Gute) - morph -- morphological analysis of the headword (e.g., Gut+Noun+Neut+Dat+Sg) - pron -- pronunciation of the headword (e.g., gUt&) - stress -- primary stress information of the word (e.g., 10) - celex -- whether the headword appears in CELEX German lexicon - train_freq -- frequency of the headword in the 80 training transcripts of CALLHOME German corpus - dev_freq -- frequency of the headword in the 20 development transcripts of CALLHOME German corpus Each of these fields is described in more detail in the sections below. 4.1 Field 1: headword --------------------- 4.1.1 Orthographic convention ----------------------------- The general orthographic convention followed in this lexicon is that of standard German as shown in the Duden edition of the ``Deutsches Universal Wörterbuch`` (1989). An exception to this is the marking of compound words. German orthography writes these as either a single word without spaces, or two (or more) words separated by a hyphen. If a compound word is written without a hyphen in German orthography, it is written with an underscore in this lexicon. For example: Standard German LDC lexicon Abenddämmerungen Abend_dämmerungen Abendprogramm Abend_programm Abendschule Abend_schule If a compound word is written with a hyphen in German orthography, the word is written with a hyphen followed by an underscore in this lexicon. For example: Standard German LDC lexicon A-negativ A-_negativ E-Mail E-_Mail Web-Seite Web-_Seite 4.1.2 On German compounds ------------------------- German has a propensity to create compound words in nouns, verbs, and adjectives. German employs three separate strategies for orthographically representing compounds: Orthographic practice Example - wordword freinahmen, Taxifahrer - word-word Prostata-Infektion - word word Goethe Haus The most common orthographic convention is running two elements of a compound word together without any spaces. In this lexicon, the parts of compound words are indicated with an underscore ``_`` between the compound elements if the compound is written without a dash or space between the elements. If the common orthographic representation has a dash between the compound elements, this lexicon indicates this with a dash-underscore ``-_`` between the compound elements. If common German orthographic practice is to include a space between compound elements, then these elements will appear as separate entries in the lexicon. The reason for including the underscore between compound elements of compound words is so that (if desired), elements of compounds could be separated out into individual words. 4.2 Field 2: morph ------------------ This field contains the morphological analysis of the headword. Each morphological analyses consists of a sequence of tags separated by "+"; e.g.: zivilisieren+Verb+3P+Sg+Ind+Pres If multiple morphological analyses exist, they are listed sequentially, separated by "//"; e.g.: zivilisieren+Verb+3P+Sg+Ind+Pres//zivilisieren+Verb+2P+Pl+Ind+Pres In cases where multiple entries with the same headword have been collapsed, there will be one set of analyses per entry, separated by "||". E.g., "zivilisiert" has two entries, the first of which contains multiple morphological analyses: Headword: zivilisiert Morph analyses: zivilisieren+Verb+3P+Sg+Ind+Pres//zivilisieren+Verb+2P+Pl+Ind+Pres//zivilisieren+Verb+Imp+Pl//zivilisieren+Verb+Part+Past || zivilisiert+Adj For a full listing of the tagset used, consult "docs/morph_tags.txt". 4.3 Field 3: pron ----------------- This field contains the word's pronunciation and follows the principles set forth in "docs/pron.txt". If a headword has multiple pronunciations, they are listed sequentially on the same line, separated by "||". When multiple pronunciations exist, each will have a distinct set of morphological analyses (see 4.2) whose order mirrors the order presented in this field. 4.4 Field 4: stress ------------------- This field contains information about primary word stress. Each syllable of the word is indicated by a number, with unstressed syllables indicated by "0" and the stressed syllable indicated by "1". For words that come from the CELEX German lexicon, some words have two primary stresses indicated (following their practice). Words from CALLHOME German Corpus have only one stress per word indicated. Alternate pronunciations separated by "||" also have corresponding alternate stress separated by "||". 4.5 Field 5: celex ------------------ This field contains information about whether the headword appears in the CELEX German lexicon. A "1" indicates that it is found in the lexicon, and a "0" indicates that the word is not found. 4.6 Field 6: train_freq ----------------------- This field provides the frequency of the headword in the 80 training transcripts from CALLHOME German Corpus. Frequency counts for words that have both an uppercase and lowercase form (such as an adjective and its derived noun) present a problem for frequency counts, since German uses capitalization both to indicate the beginning of a sentence as well as to indicate all nouns. This lexicon uses the count for the same-case form if non-zero, but resorts to reporting the capitalized count for a lowercase entry if the lowercase form is not present in the transcripts. This means that the frequency counts may be inaccurate for adjective-noun or verb-noun pairs that differ only in initial capitalization. For example, suppose the lexicon contains 2 entries, a noun ``Foo`` and a verb ``foo``, and consider a few cases: -- both occur, and the verb never starts a sentence => frequency counts are correct -- only the noun occurs => the noun Foo has correct frequency, but the verb foo (incorrectly) gets the same frequency -- only the verb occurs, and never at the beginning of a sentence => frequency counts are correct -- only the verb occurs, and always at the beginning of a sentence => the verb foo has correct frequency, but the noun Foo (incorrectly) gets the same frequency -- sometimes the verb starts a sentence, sometimes not => only the lowercase occurrences are counted for the verb entry, while capitalized occurrences of the verb end up getting counted as occurrences of the noun In other words, frequency counts for words that have both lowercase and capitalized entries in the lexicon cannot be expected to be reliable. However, if a word does occur in the transcripts, its frequency count is guaranteed to be non-zero. (The converse is not true: a word may have non-zero count when only the opposite-case word occurred in the transcripts.) This problem could be solved by: - eliminating frequency information from the lexicon, - assigning part of speech to each ambiguous word in the transcripts, - using non-standard capitalization in the transcripts, or - collapsing the capitalized and lowercase lexicon entries, but these solutions are either impractical or inconsistent with general CALLHOME practice of using standard orthography. 4.7 Field 7: dev_freq --------------------- This column contains frequency of the headword in the 20 development transcripts from CALLHOME German Corpus. 5. lexicon.dict =============== This is a CMUdict format version of "lexicon.tsv". It consists of one pronunciation per line, each line having the form: where: - WORD -- orthographic representation of word - PRON -- a single pronunciation of the word, expressed as a space-delimited sequence of phone symbols 6. Contacts =========== If you have questions about this data release, please contact the following LDC personnel: Neville Ryant