Karins, K., R. MacIntyre, M. Brandmair, S. Lauscher, and C. McLemore. LDC German Lexicon. Philadelphia: Linguistic Data Consortium, University of Pennsylvania. CONTENTS 1. Summary abstract 2. Lexicon information fields 3. Orthographic convention 4. Morphological tags 5. Phonology table 6. Stress information 7. Word source and frequency 8. On German compounds ----------------------------------------------------------------------- 1. Summary abstract The LDC German lexicon was compiled primarily for support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), sponsored by the U.S. Department of Defense. This lexicon consists of 318,807 words. Of these, 315,503 words are adapted from the CELEX German lexicon produced by The Centre for Lexical Information, Max Planck Institute for Psycholinguistics in Nijmigen, and 3,304 additional words come from the 80 training and 20 development test (devtest) transcripts (10 minutes each) from the LDC German CallHome telephone speech corpus. The LDC German lexicon contains tab-separated information fields with orthographic, morphological, phonological, stress, source, and frequency information for each word. ----------------------------------------------------------------------- 2. Lexicon information fields The LDC German lexicon contains seven tab-separated information fields: Field 1: orthographic form (headword) Field 2: morphological analysis of the headword Field 3: pronunciation of the headword Field 4: primary stress information of the headword Field 5: word frequency in the CELEX German lexicon Field 6: word frequency in the 80 LVCSR training transcripts Field 7: word frequency in the 20 LVCSR devtest transcripts In the field containing morphological information, alternate morphological analyses are separated by two slashes "//". This commonly occurs for many nominal, adjectival and verbal forms which could represent a number of different morphological inflections. In many instances, the CELEX German lexicon provides multiple entries for the same orthographic representation. In all such instances, we collapse the multiple entries into a single entry and indicate alternate morphological interpretations with "||" instead of "//". Where the CELEX German lexicon provides different pronunciations in multiple entries of the same orthographic form, we preserve this information in our collapsed entries by separating the alternate pronunciations with a "||". In all instances, the "||"-separated pronunciations mirror the "||" separated morphological information for a given orthographic word entry. This applies to the stress information field as well. ----------------------------------------------------------------------- 3. Orthographic convention The general orthographic convention followed in this lexicon is that of standard German as shown in the Duden edition of the "Deutsches Universal Wörterbuch" (1989). An exception to this is the marking of compound words. German orthography writes these as either a single word without spaces, or two (or more) words separated by a hyphen. If a compound word is written without a hyphen in German orthography, it is written with an underscore in this lexicon. For example: Standard German LDC lexicon Abenddämmerungen Abend_dämmerungen Abendprogramm Abend_programm Abendschule Abend_schule If a compound word is written with a hyphen in German orthography, the word is written with a hyphen followed by an underscore in this lexicon. For example: Standard German LDC lexicon A-negativ A-_negativ E-Mail E-_Mail Web-Seite Web-_Seite ----------------------------------------------------------------------- 4. Morphological tags The second information field of the German lexicon contains morphological information about the headword. The abbreviations used are explained below. Different bits of information in this field are separated by a slash "/". If there is more than one possible morphological parse for a given word, the different parses are separated by two slashes "//". See Section 2 above for information on our use of "||" in this field. The first entry for any morphological tag is the base form of the headword. Following this is a "+" between all bits of morphological information. +1P first person +2P second person +3P third person +Acc accusative +Adj adjective +Adv adverb +Art article (determiner) +Cmpnd compound +Comp comparative form +Conj conjunction +Dat dative +Fem feminine +Gen genitive +Imp imperative +Ind indicative +Inf infinitive +Interj interjection +Masc masculine +Neut neuter +Nom nominative +Noun noun +Num number or quantifier +Part participle +Past past +Pl plural +Place place name +Prep preposition +Pres present +Pron pronoun +Prop proper noun +Sg singular +Subj subjunctive +Suff_e adjective, numeral, or pronoun with -e suffix +Suff_em adjective, numeral, or pronoun with -em suffix +Suff_en adjective, numeral, or pronoun with -en suffix +Suff_er adjective, numeral, or pronoun with -er suffix +Suff_es adjective, numeral, or pronoun with -es suffix +Suff_s adjective, numeral, or pronoun with -s suffix (keins, deins, etc.) +Sup superlative +Verb verb +ZuInf infinitive with "zu" ----------------------------------------------------------------------- 5. Phonology table The third field in the lexicon contains pronunciation information of each headword. The phonetic symbols used are adapted from the International Phonetic Alphabet (IPA). The symbol used, its phonetic description, and an example word from German is provided in the table below. This lexicon does not contain alternate/dialectal pronunciations of words. Symbol Description Example _________________________________________________________________ a IPA script a hat e mid front lax vowel (IPA epsilon) Bett i high front lax vowel (IPA I) Mitte o IPA open o Glocke u high back lax vowel (like English 'shoot') Pult A IPA script a: Klar @ IPA e: (long mid front tense vowel) Mehl E long mid front lax vowel (long IPA epsilon) Käse I IPA i: Lied O IPA o: Boot U IPA u: Hut W IPA small null-set sign (long) Möbel (rounded tense mid front vowel) w IPA "oe" pushed together as one glyph Götter (rounded lax mid front vowel) Y IPA y (rounded tense high front vowel) für y IPA small capital Y Pfütze (rounded lax high front vowel) & schwa (unstressed central mid lax Beginn (1) unrounded vowel) ai IPA ai or aj weit au IPA au or aw Haut oi IPA oi or oj freut p IPA p Pakt b IPA b Bad t IPA t Tag d IPA d dann k IPA k kalt g IPA g Gast G velar nasal Klang m IPA m Maß n IPA n Nacht l IPA l Last r IPA r Ratte f IPA f falsch v IPA v Welt s IPA s Gas z IPA z Suppe S voiceless alveopalatal fricative Schiff Z voiced alveopalatal fricative Genie j IPA j (palatal glide) Jacke x IPA x or c-cedilla Bach, ich (voiceless velar or palatal fricative) h IPA h Hand pf IPA pf Pferd ts IPA ts Zahl tS voiceless alveopalatal affricate Matsch dZ voiced alveopalatal affricate Gin Consonant found only in loan words: V IPA w (voiced labio-velar glide) Waterproof Vowels found only in loan words: $ nonnasalized low front unrounded vowel Ragtime w~ nasalized rounded lax mid front vowel Parfum a~ nasalized unrounded low back vowel De'tente $~ nasalized low front unrounded vowel impromptu $~ nasalized low front unrounded vowel Bassin O~ nasalized low rounded back vowel Bouillon ----------------------------------------------------------------------- 6. Stress information The fourth information field in the lexicon contains information about the primary word stress in the language. Each syllable of the word is indicated by a number, with unstressed syllables indicated by "0" and the stressed syllable indicated by "1". For words that come from the CELEX German lexicon, some words have two "primary" stresses indicated (following their practice). Words which were added to the lexicon from the 80 training and 20 devtest transcripts have only one stress per word indicated. ----------------------------------------------------------------------- 7. Word source and frequency Information on word source and frequency is indicated in fields 5 - 7 of the lexicon. NB: Frequency counts for words that have both an uppercase and lowercase form (such as an adjective and its derived noun) present a problem for frequency counts, since German uses capitalization both to indicate the beginning of a sentence as well as to indicate all nouns. This lexicon uses the count for the same-case form if non-zero, but resorts to reporting the capitalized count for a lowercase entry if the lowercase form is not present in the transcripts. This means that the frequency counts may be inaccurate for adjective-noun or verb-noun pairs that differ only in initial capitalization. For example, suppose the lexicon contains 2 entries, a noun "Foo" and a verb "foo", and consider a few cases: -- both occur, and the verb never starts a sentence => frequency counts are correct -- only the noun occurs => the noun Foo has correct frequency, but the verb foo (incorrectly) gets the same frequency -- only the verb occurs, and never at the beginning of a sentence => frequency counts are correct -- only the verb occurs, and always at the beginning of a sentence => the verb foo has correct frequency, but the noun Foo (incorrectly) gets the same frequency -- sometimes the verb starts a sentence, sometimes not => only the lowercase occurrences are counted for the verb entry, while capitalized occurrences of the verb end up getting counted as occurrences of the noun In other words, frequency counts for words that have both lowercase and capitalized entries in the lexicon cannot be expected to be reliable. However, if a word does occur in the transcripts, its frequency count is guaranteed to be non-zero. (The converse is not true: an word may have non-zero count when only the opposite-case word occurred in the transcripts.) This problem could be solved by: - eliminating frequency information from the lexicon, - assigning part of speech to each ambiguous word in the transcripts, - using non-standard capitalization in the transcripts, or - collapsing the capitalized and lowercase lexicon entries, but these solutions are either impractical or inconsistent with general CallHome practice of using standard orthography. CELEX: The fifth tab-separated field in the lexicon contains information about whether the headword appears in the CELEX German lexicon. A "1" indicates that it is found in the lexicon, and a "0" indicates that the word is not found. TRAINING: The sixth tab-separated field in the lexicon contains the number of occurrences of the headword in the 80 LVCSR training transcripts. DEVTEST: The seventh tab-separated field in the lexicon contains the number of occurrences of the headword in the 20 LVCSR development test (devtest) transcripts. ----------------------------------------------------------------------- 8. On German compounds German has a propensity to create compound words in nouns, verbs, and adjectives. German employs three separate strategies for orthographically representing compounds: Orthographic practice Example - wordword freinahmen, Taxifahrer - word-word Prostata-Infektion - word word Goethe Haus The most common orthographic convention is running two elements of a compound word together without any spaces. In this lexicon, we indicate the parts of compound words with an underscore "_" between the compound elements if the compound is written without a dash or space between the elements. If the common orthographic representation has a dash between the compound elements, this lexicon indicates this with a dash-underscore "-_" between the compound elements. If common German orthographic practice is to include a space between compound elements, then these elements will appear as separate entries in the lexicon. The reason for including the underscore between compound elements of compound words is so that (if desired), elements of compounds could be separated out into individual words. This may be desired for the scoring of such words in the LVCSR project.