CALLHOME German Lexicon Second Edition

                              April 23, 2025

                          Linguistic Data Consortium


1. Overview
===========
This is an updated release of the CALLHOME German Lexicon (LDC97L18). The
original CALLHOME lexicon was compiled by the Linguistic Data Consortium in
support of the project on Large Vocabulary Conversational Speech Recognition
(LVCSR), sponsored by the U.S. Department of Defense.

This re-release retains the same 318,809 words and associated information
(morphological, phonological, stress, lexical frequencies, etc) from the
original release. However, the directory structure, file formats, and
documentation have been updated to modern standards.


2. Directory structure
======================
- data/lexicon.tsv  --  lexicon in TSV format
- data/lexicon.dict  --  a pronunciation dictionary derived from lexicon;
  in CMUdict format
- docs/file.tbl  --  listing of md5 checksums, sizes, dates, and file names
- docs/README.txt  --  this file; a top-level documentation of release
- docs/pron.txt  --  documents the conventions used by the pronunciation field
- docs/morph_tags.txt  --  documents the morphological tags used in the
  morphological analyses


3. Construction
===============
Of the 318,809 words in the lexicon, 315,501 are adapted from the CELEX German
lexicon produced by The Centre for Lexical Information, Max Planck Institute
for Psycholinguistics in Nijmigen, and 3,308 additional words come from the 80
training and 20 development test (devtest) transcripts (10 minutes each) from
the LDC German CallHome telephone speech corpus.


4. lexicon.tsv
==============
This is a UTF-8 encoded TSV version of the lexicon originally distributed with
LDC97L18. It contains one entry per line, each consisting of seven tab-delimited
fields:

- headword  --  orthographic form (e.g., Gute)
- morph  --  morphological analysis of the headword (e.g.,
  Gut+Noun+Neut+Dat+Sg)
- pron  --  pronunciation of the headword (e.g., gUt&)
- stress  --  primary stress information of the word (e.g., 10)
- celex  --  whether the headword appears in CELEX German lexicon
- train_freq  --  frequency of the headword in the 80 training transcripts of
  CALLHOME German corpus
- dev_freq  --  frequency of the headword in the 20 development transcripts of
  CALLHOME German corpus

Each of these fields is described in more detail in the sections below.


4.1 Field 1: headword
---------------------
4.1.1 Orthographic convention
-----------------------------
The general orthographic convention followed in this lexicon is that of
standard German as shown in the Duden edition of the ``Deutsches Universal
Wörterbuch`` (1989).

An exception to this is the marking of compound words. German orthography
writes these as either a single word without spaces, or two (or more) words
separated by a hyphen. If a compound word is written without a hyphen in
German orthography, it is written with an underscore in this lexicon. For
example:

        Standard German                 LDC lexicon

        Abenddämmerungen                Abend_dämmerungen
        Abendprogramm                   Abend_programm
        Abendschule                     Abend_schule

If a compound word is written with a hyphen in German orthography, the word is
written with a hyphen followed by an underscore in this lexicon. For example:

        Standard German                 LDC lexicon

        A-negativ                       A-_negativ
        E-Mail                          E-_Mail
        Web-Seite                       Web-_Seite


4.1.2 On German compounds
-------------------------
German has a propensity to create compound words in nouns, verbs, and
adjectives. German employs three separate strategies for orthographically
representing compounds:

     Orthographic practice      Example

        - wordword              freinahmen, Taxifahrer
        - word-word             Prostata-Infektion
        - word word             Goethe Haus

The most common orthographic convention is running two elements of a compound
word together without any spaces. In this lexicon, the parts of compound words
are indicated with an underscore ``_`` between the compound elements if the
compound is written without a dash or space between the elements. If the
common orthographic representation has a dash between the compound elements,
this lexicon indicates this with a dash-underscore ``-_`` between the compound
elements. If common German orthographic practice is to include a space between
compound elements, then these elements will appear as separate entries in the
lexicon.

The reason for including the underscore between compound elements of compound
words is so that (if desired), elements of compounds could be separated out
into individual words.


4.2 Field 2: morph
------------------
This field contains the morphological analysis of the headword. Each
morphological analyses consists of a sequence of tags separated by "+"; e.g.:

    zivilisieren+Verb+3P+Sg+Ind+Pres

If multiple morphological analyses exist, they are listed sequentially,
separated by "//"; e.g.:

    zivilisieren+Verb+3P+Sg+Ind+Pres//zivilisieren+Verb+2P+Pl+Ind+Pres

In cases where multiple entries with the same headword have been collapsed,
there will be one set of analyses per entry, separated by "||". E.g.,
"zivilisiert" has two entries, the first of which contains multiple
morphological analyses:

    Headword:       zivilisiert
    Morph analyses: zivilisieren+Verb+3P+Sg+Ind+Pres//zivilisieren+Verb+2P+Pl+Ind+Pres//zivilisieren+Verb+Imp+Pl//zivilisieren+Verb+Part+Past || zivilisiert+Adj

For a full listing of the tagset used, consult "docs/morph_tags.txt".


4.3 Field 3: pron
-----------------
This field contains the word's pronunciation and follows the principles set
forth in "docs/pron.txt". If a headword has multiple pronunciations, they are
listed sequentially on the same line, separated by "||". When multiple
pronunciations exist, each will have a distinct set of morphological analyses
(see 4.2) whose order mirrors the order presented in this field.


4.4 Field 4: stress
-------------------
This field contains information about primary word stress. Each syllable of
the word is indicated by a number, with unstressed syllables indicated by
"0" and the stressed syllable indicated by "1". For words that come from
the CELEX German lexicon, some words have two primary stresses indicated
(following their practice). Words from CALLHOME German Corpus have only one
stress per word indicated.

Alternate pronunciations separated by "||" also have corresponding alternate
stress separated by "||".


4.5 Field 5: celex
------------------
This field contains information about whether the headword appears in the
CELEX German lexicon. A "1" indicates that it is found in the lexicon, and a
"0" indicates that the word is not found.


4.6 Field 6: train_freq
-----------------------
This field provides the frequency of the headword in the 80 training transcripts
from CALLHOME German Corpus.

Frequency counts for words that have both an uppercase and lowercase form
(such as an adjective and its derived noun) present a problem for frequency
counts, since German uses capitalization both to indicate the beginning of a
sentence as well as to indicate all nouns. This lexicon uses the count for the
same-case form if non-zero, but resorts to reporting the capitalized count for
a lowercase entry if the lowercase form is not present in the transcripts.

This means that the frequency counts may be inaccurate for adjective-noun or
verb-noun pairs that differ only in initial capitalization. For example,
suppose the lexicon contains 2 entries, a noun ``Foo`` and a verb ``foo``, and
consider a few cases:

  -- both occur, and the verb never starts a sentence
     => frequency counts are correct
  -- only the noun occurs
     => the noun Foo has correct frequency, but the verb foo (incorrectly)
        gets the same frequency
  -- only the verb occurs, and never at the beginning of a sentence
     => frequency counts are correct
  -- only the verb occurs, and always at the beginning of a sentence
     => the verb foo has correct frequency, but the noun Foo (incorrectly)
        gets the same frequency
  -- sometimes the verb starts a sentence, sometimes not
     => only the lowercase occurrences are counted for the verb entry, while
        capitalized occurrences of the verb end up getting counted as
        occurrences of the noun

In other words, frequency counts for words that have both lowercase and
capitalized entries in the lexicon cannot be expected to be reliable. However,
if a word does occur in the transcripts, its frequency count is guaranteed to
be non-zero. (The converse is not true: a word may have non-zero count when
only the opposite-case word occurred in the transcripts.)

This problem could be solved by:
  - eliminating frequency information from the lexicon,
  - assigning part of speech to each ambiguous word in the transcripts,
  - using non-standard capitalization in the transcripts, or
  - collapsing the capitalized and lowercase lexicon entries,

but these solutions are either impractical or inconsistent with general
CALLHOME practice of using standard orthography.


4.7 Field 7: dev_freq
---------------------
This column contains frequency of the headword in the 20 development
transcripts from CALLHOME German Corpus.


5. lexicon.dict
===============
This is a CMUdict format version of "lexicon.tsv". It consists of one
pronunciation per line, each line having the form:

    <WORD> <PRON>

where:

- WORD  --  orthographic representation of word
- PRON  --  a single pronunciation of the word, expressed as a space-delimited
  sequence of phone symbols


6. Contacts
===========
If you have questions about this data release, please contact the following
LDC personnel:

    Neville Ryant
    <nryant@ldc.upenn.edu>