CALLHOME Spanish Lexicon Second Edition

                               June 24, 2025

                          Linguistic Data Consortium


1. Overview
===========
This is an updated release of the CALLHOME Spanish Lexicon (LDC96L16). The
original CALLHOME lexicon was compiled by the Linguistic Data Consortium in
support of the project on Large Vocabulary Conversational Speech Recognition
(LVCSR), sponsored by the U.S. Department of Defense.

This re-release retains the same 45,547 words and associated information
(morphological, phonological, stress, lexical frequencies, etc) from the
original release. However, the directory structure, file formats, and
documentation have been updated to modern standards.


2. Directory structure
======================
- data/lexicon.tsv  --  lexicon in TSV format
- data/lexicon.dict  --  a pronunciation dictionary derived from lexicon;
  in CMUdict format
- docs/file.tbl  --  listing of md5 checksums, sizes, dates, and file names
- docs/README.txt  --  this file; a top-level documentation of release
- docs/pron.txt  --  documents the conventions used by the pronunciation field
- docs/morph_tags.txt  --  documents the morphological tags used in the
  morphological analyses
- docs/preferences.txt  --  a list of hand-corrected pronunciations
- tools/g2p  --  grapheme to phoneme tools used to automatically generate
  pronunciations in the lexicon from the original release


3. lexicon.tsv
==============
This is a UTF-8 encoded TSV version of the lexicon originally distributed with
LDC96L16. It contains one entry per line, each consisting of nine tab-delimited
fields:

- headword  --  orthographic form (e.g., niño)
- morph  --  morphological analysis of the headword (e.g., niño+Noun+Masc+Sg)
- pron  --  pronunciation of the headword (e.g., niNo)
- stress  --  primary stress information of the word (e.g., 10)
- callh_freq  --  frequency of the headword in the 80 training transcripts of
  CALLHOME Spanish corpus
- madrid_freq  --  frequency of the headword in Madrid Radio transcripts
- ap_freq  --  frequency of the headword in AP newswire
- reut_freq  --  frequency of the headword in Reuters newswire
- norte_freq  --  requency of the headword in El Norte newswire

Each of these fields is described in more detail in the sections below.


3.1 Field 1: headword
---------------------
The orthographic representation of a headword, using canonical capitalization,
i.e. proper names and such are capitalized.


3.2 Field 2: morph
------------------
This field contains the morphological analysis of the headword. Each
morphological analyses consists of a sequence of tags separated by "+"; e.g.:

For a full listing of the tagset used, consult "docs/morph_tags.txt".

    árabe+Adj+MF+Sg

If multiple morphological analyses exist, they are listed sequentially,
separated by "||"; e.g.

    árabe+Adj+MF+Sg || árabe+Noun+MF+Sg

Note that 313 words are tagged as foreign (+For). These tags were automatically
generated by the pronunciation software and are not exclusively foreign words.
All such words have hand-corrected pronunciations in addition to the
automatically generated ones (see 3.3).

For a full listing of the tagset used, consult "docs/morph_tags.txt".


3.3 Field 3: pron
-----------------
This field contains the word's pronunciation and follows the principles set
forth in "docs/pron.txt". If a headword has multiple pronunciations, they are
listed sequentially on the same line, separated by "||".

Pronunciations were generated automatically, though 313 words in the lexicon
have been hand-corrected for pronunciation; these words are tagged as "+For" in
the morph field. This class includes not only foreign words, but also single
orthographic characters with no stressable vowel; some interjections,
hesitation sounds, and acronyms. All of these hand-corrected pronunciations are
listed in "docs/preferences.txt".


3.4 Field 4: stress
-------------------
This field contains information about primary word stress. Each syllable of
the word is indicated by a number, with unstressed syllables indicated by
"0" and the stressed syllable indicated by "1".  Only one stressed syllable
per word is indicated.

Alternate pronunciations separated by "||" also have corresponding alternate
stress separated by "||".


3.5 Field 5: callh_freq
-----------------------
This field provides the frequency of the headword in the 80 training transcripts
in CALLHOME Spanish corpus. These are raw frequencies; i.e., not normalized in
any way. Frequencies are provided for alphabetic words only (i.e., exlcuding
punctuation and such) and ignore case.


3.6 Field 6: madrid_freq
------------------------
This field provides the frequency of the headword in Madrid Radio transcripts.


3.7 Field 7: ap_freq
--------------------
This field provides the frequency of the headword in Associated Press newswire
text, assembled from their Spanish language services in Argentina, Brazil,
Venezuela, and Puerto Rico.


3.8 Field 8: reut_freq
----------------------
This field provides the frequency of the headword in Reuters newswire text, from
Reuters Latin American Business Report, created in Brazil; and Reuters Spanish
Language Business Report, created in Argentina.


3.9 Field 9: norte_freq
-----------------------
This field provides the frequency of the headword in El Norte newswire from
Mexico.


4. lexicon.dict
===============
This is a CMUdict format version of "lexicon.tsv". It consists of one
pronunciation per line, each line having the form:

    <WORD>\t<PRON>

where:

- WORD  --  orthographic representation of word
- PRON  --  a single pronunciation of the word, expressed as a space-delimited
  sequence of phone symbols


5. Contacts
===========
If you have questions about this data release, please contact the following
LDC personnel:

    Neville Ryant
    <nryant@ldc.upenn.edu>