Garrett, Susan, T. Morton, and C. McLemore. 1997. LDC Spanish Lexicon. Philadelphia: Linguistic Data Consortium, University of Pennsylvania. ----------------------------------------------------------- Description of the LDC Spanish lexicon ----------------------------------------------------------- CONTENTS 1. Summary abstract 2. Token coverage 3. Lexicon information fields 4. Orthographic conventions 5. Morphological tags 6. Pronunciation 7. Stress information 8. Frequency counts and corpora ----------------------------------------------------------------------- 1. Summary abstract The LDC Spanish lexicon was compiled primarily for support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), sponsored by the U.S. Department of Defense. The LDC Spanish lexicon consists of 45,582 word, and contains separate information fields with morphological, phonological, stress, and frequency information for each word. ----------------------------------------------------------------------- 2. Token coverage The token coverage by the LDC Spanish lexicon of words occurring in the 20 LDC Spanish Callhome devtest transcripts (10 minutes of conversation each) is 98.7%. ----------------------------------------------------------------------- 3. Lexicon information fields The LDC Spanish lexicon contains nine tab-separated information fields: Field 1: orthographic form (headword) Field 2: morphological analysis of the headword Field 3: pronunciation of the headword Field 4: primary stress information of the word Field 5: frequency of the word in the 80 Callhome Spanish training transcripts Field 6: frequency of the word in Corpus Oral radio transcripts Field 7: frequency of the word in AP newswire Field 8: frequency of the word in Reuters newswire Field 9: frequency of the word in El Norte newswire More on each of these fields is described the sections below. ----------------------------------------------------------------------- 4. Orthographic conventions The orthographic representation has canonical capitalization, i.e. proper names and such are capitalized. ----------------------------------------------------------------------- 5. Morphological tags Following is a complete list of abbreviations used for morphological information. +1P 1st person (verbs and pronouns) +2P 2nd person +3P 3rd person +Abrev abbreviations +Acc accusative (pronoun) +Acro acronym +Adj adjective +Adv adverb +Affix prefix or suffix transcribed with surrounding white space +Apoc apocope, truncated forms +Art article +Aum augmentative +Card marks cardinal numbers +Com complement of preposition (pronoun) +Cond conditional +Conj conjunctions +Continent continent names +Cort courtesy phrase +Dat dative +Def definite (article) +Dem demonstrative (pronouns) +Det determiner +Dim diminutive (nouns and adjectives) +FSubj future subjunctive +Fem feminine gender +For foreign word (see below) +Fut future +Gen genitive (pronoun) +IInd imperfect indicative +ISubj imperfect subjunctive +Imp imperative +Indef indefinite (article) +Inf infinitive +Interj interjections +Interrog interrogative, on some pronouns +Let letters (letter names and literal letters) +Lit literal letters +Loc location +Lugar place +MF masculine/feminine, invariant for gender +Masc masculine gender, on nouns, adjectives, pronouns +NP acts as full noun phrase +Neut neutral, for some pronouns +Nom nominative (pronoun) +Noun nouns +Num number +Obl oblique (pronoun) +Onom onomatopoeia +Org organization +PInd present indicative, for verbs +PP acts as full prepositional phrase +PSubj present subjunctive +PastPart past participle +Pay country +Perf perfect (indicative) +Pl plural number +Pluperf pluperfect +Poss possessive (pronoun) +PostDet postdeterminer +PreDet predeterminer +Prep prepositions +PresPart present participle +Pron pronoun +Prop proper nouns +Quant quantifier +Reas reason (interrogatives) +Ref reflexive (pronoun) +Rel relative (pronoun) +SP singular/plural, invariant for number +Sg singular number, on nouns, adjectives, pronouns +Soc company name +Sup superlative (adjective) +Temp temporal (interrogatives) +Titl title +Usastate state in USA +Var dialectal or stylistic variant +Verb verb +Zodiac zodiac names | indicates the boundary between POS and a clitic. Also used for adverbs formed from adjectives via addition of a suffix (usually "-mente", tagged as +Adj|Adv) // used to separate alternate analyses _ used in four cases to separate constituent words of a headword given in contracted form (mija, mijita, mijito, mijo). The 313 words tagged as +For in the morphological field are strictly defined as those that are not automatically generated by the pronunciation software described below. They are not exclusively foreign words. ----------------------------------------------------------------------- 6. Pronunciation Following is the allophone set for (the Mexican reference dialect of) Spanish used in the pronunciation field of the lexicon: a IPA script a i IPA i e IPA e o IPA o u IPA u h IPA h p IPA p b IPA b B IPA beta (voiced bilabial fricative) f IPA f v IPA v l IPA l m IPA m w IPA w t IPA t d IPA d D IPA thorn - or approximant rather than fricative s IPA s S IPA voiceless postalveolar fricative C IPA t esh (voiceless postalveolar affricate) J IPA voiced postalveolar affricate n IPA n y IPA j (palatal glide) r coronal tap R coronal trill x IPA c cedilla (laminopalatal) or x (velar) - fricative N IPA left-hook-bottom n (palatal nasal) k IPA k g IPA g G IPA gamma (velar fricative) - or approximant 9 IPA right-hook-bottom n (velar nasal) (not yet used) z IPA z // used to separate alternate analyses Except for the class of exceptions noted below, pronunciations given in field three of the lexicon have been generated automatically from the orthography using the following software and accompanying file, which are available from the LDC: spron.pl basic_rules The class of exceptions, 313 words in the lexicon, have been hand- corrected for pronunciation and tagged as "For" in the morphological field. This class includes not only foreign words, but single orthographic characters with no stressable vowel; some interjections, hesitation sounds, and acronyms; and words that have alternate pronunciations separated by //. All of these hand-corrected pronunciations are listed in the accompanying file: preferences ----------------------------------------------------------------------- 7. Stress information The fourth field in the lexicon contains information about primary word stress. Each syllable of the word is indicated by a number, with unstressed syllables indicated by "0" and the stressed syllable indicated by "1". Only one stressed syllable per word is indicated. Alternate pronunciations separated by // also have corresponding alternate stress separated by //. ----------------------------------------------------------------------- 8. Frequency counts and corpora The frequency counts shown in fields 5-9 of the lexicon are raw, i.e. not normalized in any way; they are for orthographic forms regardless of capitalization. We have used only alphabetic words from the corpora, i.e. excluding punctuation and such. A brief description of the corpora from which frequency counts were taken is as follows. Field 5: 80 transcripts: 143,394 words from the 80 LDC Spanish Callhome telephone speech corpus transcripts Field 6: Corpus Oral: 941,199 words of Madrid Radio transcripts Field 7: AP newswire: 8,429,549 words of Associated Press newswire text, assembled from their Spanish language services in Argentina, Brazil, Venezuela, and Puerto Rico Field 8: Reuters: 18,742,153 words of Reuters newswire text, from Reuters Latin American Business Report, created in Brazil, and Reuters Spanish Language Business Report, created in Argentina Field 9: El Norte: 29,745,911 words of El Norte newswire, from Mexico