CALLHOME Spanish Lexicon Second Edition June 24, 2025 Linguistic Data Consortium 1. Overview =========== This is an updated release of the CALLHOME Spanish Lexicon (LDC96L16). The original CALLHOME lexicon was compiled by the Linguistic Data Consortium in support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), sponsored by the U.S. Department of Defense. This re-release retains the same 45,547 words and associated information (morphological, phonological, stress, lexical frequencies, etc) from the original release. However, the directory structure, file formats, and documentation have been updated to modern standards. 2. Directory structure ====================== - data/lexicon.tsv -- lexicon in TSV format - data/lexicon.dict -- a pronunciation dictionary derived from lexicon; in CMUdict format - docs/file.tbl -- listing of md5 checksums, sizes, dates, and file names - docs/README.txt -- this file; a top-level documentation of release - docs/pron.txt -- documents the conventions used by the pronunciation field - docs/morph_tags.txt -- documents the morphological tags used in the morphological analyses - docs/preferences.txt -- a list of hand-corrected pronunciations - tools/g2p -- grapheme to phoneme tools used to automatically generate pronunciations in the lexicon from the original release 3. lexicon.tsv ============== This is a UTF-8 encoded TSV version of the lexicon originally distributed with LDC96L16. It contains one entry per line, each consisting of nine tab-delimited fields: - headword -- orthographic form (e.g., niño) - morph -- morphological analysis of the headword (e.g., niño+Noun+Masc+Sg) - pron -- pronunciation of the headword (e.g., niNo) - stress -- primary stress information of the word (e.g., 10) - callh_freq -- frequency of the headword in the 80 training transcripts of CALLHOME Spanish corpus - madrid_freq -- frequency of the headword in Madrid Radio transcripts - ap_freq -- frequency of the headword in AP newswire - reut_freq -- frequency of the headword in Reuters newswire - norte_freq -- requency of the headword in El Norte newswire Each of these fields is described in more detail in the sections below. 3.1 Field 1: headword --------------------- The orthographic representation of a headword, using canonical capitalization, i.e. proper names and such are capitalized. 3.2 Field 2: morph ------------------ This field contains the morphological analysis of the headword. Each morphological analyses consists of a sequence of tags separated by "+"; e.g.: For a full listing of the tagset used, consult "docs/morph_tags.txt". árabe+Adj+MF+Sg If multiple morphological analyses exist, they are listed sequentially, separated by "||"; e.g. árabe+Adj+MF+Sg || árabe+Noun+MF+Sg Note that 313 words are tagged as foreign (+For). These tags were automatically generated by the pronunciation software and are not exclusively foreign words. All such words have hand-corrected pronunciations in addition to the automatically generated ones (see 3.3). For a full listing of the tagset used, consult "docs/morph_tags.txt". 3.3 Field 3: pron ----------------- This field contains the word's pronunciation and follows the principles set forth in "docs/pron.txt". If a headword has multiple pronunciations, they are listed sequentially on the same line, separated by "||". Pronunciations were generated automatically, though 313 words in the lexicon have been hand-corrected for pronunciation; these words are tagged as "+For" in the morph field. This class includes not only foreign words, but also single orthographic characters with no stressable vowel; some interjections, hesitation sounds, and acronyms. All of these hand-corrected pronunciations are listed in "docs/preferences.txt". 3.4 Field 4: stress ------------------- This field contains information about primary word stress. Each syllable of the word is indicated by a number, with unstressed syllables indicated by "0" and the stressed syllable indicated by "1". Only one stressed syllable per word is indicated. Alternate pronunciations separated by "||" also have corresponding alternate stress separated by "||". 3.5 Field 5: callh_freq ----------------------- This field provides the frequency of the headword in the 80 training transcripts in CALLHOME Spanish corpus. These are raw frequencies; i.e., not normalized in any way. Frequencies are provided for alphabetic words only (i.e., exlcuding punctuation and such) and ignore case. 3.6 Field 6: madrid_freq ------------------------ This field provides the frequency of the headword in Madrid Radio transcripts. 3.7 Field 7: ap_freq -------------------- This field provides the frequency of the headword in Associated Press newswire text, assembled from their Spanish language services in Argentina, Brazil, Venezuela, and Puerto Rico. 3.8 Field 8: reut_freq ---------------------- This field provides the frequency of the headword in Reuters newswire text, from Reuters Latin American Business Report, created in Brazil; and Reuters Spanish Language Business Report, created in Argentina. 3.9 Field 9: norte_freq ----------------------- This field provides the frequency of the headword in El Norte newswire from Mexico. 4. lexicon.dict =============== This is a CMUdict format version of "lexicon.tsv". It consists of one pronunciation per line, each line having the form: \t where: - WORD -- orthographic representation of word - PRON -- a single pronunciation of the word, expressed as a space-delimited sequence of phone symbols 5. Contacts =========== If you have questions about this data release, please contact the following LDC personnel: Neville Ryant