Huang, Shudong, X. Bian, G. Wu, and C. McLemore. 1997. LDC Mandarin Lexicon. Philadelphia: Linguistic Data Consortium, University of Pennsylvania. ----------------------------------------------------------- Description of the LDC Mandarin lexicon ----------------------------------------------------------- CONTENTS 1. Summary abstract 2. Token coverage 3. Lexicon information fields 4. Orthographic convention 5. Phonology table 6. Frequency counts and corpora 7. Part of Speech Tags 8. Word Segmentation ----------------------------------------------------------------------- 1. Summary abstract The LDC Mandarin lexicon was compiled primarily for support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), sponsored by the U.S. Department of Defense. The LDC Mandarin lexicon consists of 44,405 words, and contains separate information fields with phonological, morphological, and frequency information for each word. ----------------------------------------------------------------------- 2. Token coverage The token coverage by the LDC Mandarin lexicon of words occurring in the 20 LDC Mandarin Callhome devtest transcripts (10 minutes of conversation each) is 98%. ----------------------------------------------------------------------- 3. Lexicon information fields The LDC Mandarin lexicon contains nine tab-separated information fields: Field 1: orthographic form in hanzi (headword) Field 2: headword in tone pinyin Field 3: tone sequence only of headword Field 4: allophonic representation of pronunciation, without tone Field 5: frequency of the word in 3,431,707 words of Xinhua newswire Field 6: frequency of the word in the 155,276 words of the 80 Callhome training transcripts Field 7: tag indicating part of speech Each of these fields is described the sections below. ----------------------------------------------------------------------- 4. Orthographic convention Orthographic Chinese characters are GB-encoded, and are simplified in the Mainland style. Field 2 gives a representation of the headword in tone pinyin with strictly lexical tone, i.e. not reflecting phonetic/phonological processes. Field 3 shows tones only, and the tone shown differs from the representation in field 2 in that tone sandhi has been applied to all disyllabic words, in which it is a regular process: the sequence 3 3 becomes 2 3. For all words of greater length, the application of tone sandhi has been determined by hand, judging from word structure. (For words that have sequences of tone 3 and more than 2 syllables, all but the rightmost tone may change to 2, e.g., 2 2 2 3, in fast speech; but in careful speech, the application of tone sandhi depends on syntactic constituency. The careful speech pronunciation is the one shown.) ----------------------------------------------------------------------- 5. Phonology table. The representation of citation pronunciation given in this field is according to the following allophonic table: Allophone Pinyin (standard, citational) IPA or description b b p p p aspirated p m m m d d t t t aspirated t l l l n n n g g k k k aspirated k h h laryngeal or velar fricative N ng velar nasal z z dental affricate (ts) c c aspirated dental affricate ts s s s j j palatal affricate (tS) q q aspirated palatal affricate x x voiceless palatal fricative r r retroflex r Z zh retroflex affricate C ch aspirated retroflex affricate S sh voiceless retroflex fricative f f f y y,i j w w,u w W u high front front rounded glide i -i, -in, -ing i I ci, si, zi barred i % chi, shi, zhi,ri retroflex i e -ei, -ui e E -ian, -ie, -u:e, ye, yan lower-mid front unrounded (IPA epsilon) U ju, lu:, nu:, qu, xu, yu high front rounded (IPA y) & -en, -eng, -e, -un, -er schwa, mid central unrounded a -a, -ang, -ao, -iao, -iang, -ua, -uang back allophone of /a/ @ -ai, -an, -uai, -uan front allophone of /a/ o -o, -ou, -iong, yong, -iu, -ong mid back round (IPA o) > -uo, wo low back round between /o/ and /&/ (IPA open o) u -u high back rounded (IPA u) R er rhotic schwa ----------------------------------------------------------------------- 6. Frequency counts and corpora. Field 5 of the lexicon shows the frequency of the word in 3,431,707 words of Xinhua newswire. Note, however, that this newswire text was not segmented with the automatic segmenter described in section 8 below; word segmentation in the newswire text may differ rather dramatically from word segmentation in the transcripts, affecting frequency counts. Field 6 of the lexicon shows the frequency of the word in the 155,276 words of the 80 Callhome training transcripts. The frequency counts provided are by representation of the word in Chinese characters, not pronunciation. In cases where a single character, or character sequence, has multiple pronunciations, the same frequency count is given for every entry. For example, the same character represents 5 distinct pronunciations of "a"; for each of those 5 separate entries, the frequencies 31 and 3705 are given. In fact, the numbers representing occurrences are distributed over the 5 different pronunciations of the characters in some way that is not identified. ----------------------------------------------------------------------- 7. Part of speech tags. Tagging conventions for indicating part of speech are given below. / following a tag by itself, indicates that there are additional unspecified categories in terms of which this word can be used; between tags, indicates alternative categories that depend on context of use. NOTE that the sequence noun/verb/noun occurs (though rarely); in this case we have allowed duplicate categories for words which function as two distinct nouns, one which is closely related to the verb, and one that is completely different. acronym acronyms adj adjectives adj_r reduplicated adjectives adv adverbs adv_r reduplicated adverbs affix affixes class classifiers class_r reduplicated classifiers conj conjunctions for_name all foreign ("Mandarinized") proper nouns, including place names and personal names interj interjections name proper nouns, including personal names and some unsegmented well-known full names (not surnames) name_seg unsegmented full names (personal and surname) (note that this category overlaps with the "name" category to some extent) name_affix personal name with an affix, e.g. prefix "a" noun nouns, and common, conventionalized noun phrases number numbers number_class classifiers attached to numbers onom onomatopoeic words part particles other than those below (part_*) part_struc structural particles part_asp aspect markers part_final sentence-final particles phrase idiomatic or conventionalized phrase, e.g., like English "that is" prep prepositions pro pronouns, demonstratives, "wh"-words surname surnames surname_affix surname with prefix or suffix, e.g. familiarity morpheme verb verbs verb_r reduplicated verbs ----------------------------------------------------------------------- 8. Word segmentation. Word segmentation principles for Mandarin were formulated by Shudong Huang at the Linguistic Data Consortium, with input from Xuejun Bian and Cynthia McLemore, and in subsequent collaboration with LVCSR Callhome contractors and other interested parties (especially Dragon, BBN, IBM, TI, NIST, and Bell Labs). A primary source of information on Chinese segmentation issues was the following: "Contemporary Chinese Language Word Segmentation Specification for Information Processing," published by the State Bureau of Technology Supervision, Beijing China, October 14, 1992. Principles guiding word segmentation can be found in the file "segmentation.principles". The Callhome Mandarin transcripts were automatically segmented with the Dragon Mandarin segmenter, which uses the LDC principles as stated in the "segmentation.principles" file. Further information on the Dragon segmenter, provided by Dean Bandes at Dragon Systems Inc., follows: The Dragon Mandarin Segmenter attempts to break a string of Chinese characters (in the GB encoding) into the most likely sequence of words in its lexicon and unknown words. Characters which do not fit into known words are output as unknown single-character words. In general, longer words are preferred, but not at the expense of introducing new unknown single-character words. An analogy is the case of a sign maker who has a stock of strings of letters with a cost associated to each string, who wants to produce a given sign for the minimum cost. The cost of each string is based on its frequency (supply and demand!), and the cost of the entire sign is the sum of the cost of the strings plus a relatively large cost per string (labor to put them together). All letters are available individually, but to save on labor cost the sign maker will choose not to use them if the sign can be made of existing combinations of letters. Simply proceeding from the beginning of the sign and choosing the longest available string may not produce the least expensive sign, as one may overshoot a preferable string; for instance, if the desired sign were "No Parking" and the stock of strings were "No ", "Park", "Par", "king", "i", "n", and "g", always choosing the longest would give "No " + "Park" + "i" + "n" + "g", which would be more expensive than "No " + "Par" + "king". Simply starting at the end and working backwards always choosing the longest word won't work any better. The low-cost segmentation is either a single lexicon entry or the combination of the lowest-cost segmentations of two substrings, and thus could in principle be found recursively, trying segmentations of substrings at each break point; but this would require a lot of duplicated effort. The problem may be solved much more efficiently by standard dynamic programming methods, and that's what the Dragon Segmenter does. Basically, it remembers the lowest-cost way to segment text up to each character in turn, looking ahead for matches and noting the cost to the end of each matching string if it is a new low-cost way to that character. The input lexicon must have the format [] -- that is, the pinyin is optional but the count (frequency) is required; and the word with the largest count must be first. Please contact Dean Bandes at Dragon Systems for further information or to obtain a copy of the program. Dragon Systems 320 Nevada Street Newton MA 02160 Phone: (617) 965-5200 x221 Fax: (617) 244-3899 e-mail: deanb@dragonsys.com -----------------------------------------------------------------------