March 1995 Draft ************************************************************ General Principles for Chinese Segmentation ************************************************************ The following principles were formulated by Shudong Huang at the Linguistic Data Consortium, with input from Xuejun Bian and Cynthia McLemore. A primary source of information on Chinese segmentation issues was the following: "Contemporary Chinese Language Word Segmentation Specification for Information Processing," published by the State Bureau of Technology Supervision, Beijing China, October 14, 1992. A. NOUNS A.1 Common Nouns A.1.1 Non-monomorphemic NPs are segmented only if the segmentation does not affect the meaning of the noun. Not segmented: huo3che1 train niu2rou4 beef bei4zi3zhi2wu4 angiosperm Segmented: lu:4 ye4 green leaf xiao3 chuang2 small bed A.1.2 Prefix-N (or any other word class) and N(or any other word class)-suffix are not segmented lao3ying1 eagle chao1sheng1bo1 super-sonic yi3zi5 chair ke1xue2jia1 scientist Examples of prefixes: a1, lao3, fei1... Examples of suffixes: jia1, shou3, hua4... A.1.3 Basic scientific terms are not segmented jia1su4du4 acceleration zhong1yang1chu3li3qi4 CPU A.1.4 Directional words are separated from preceding nouns zhuo1zi5 shang5 on the table chang2jiang1 yi3bei3 north of Yangtze A.1.5 The plural morpheme 'men' is listed as a single unit except for the word 'ren2men5' ('people') and some idiomatic expressions such as ger1men5 brotherhood ye2menr5 man A.1.6. Nouns for months are not segmented into their component morphemes (i.e., number+'moon'). A.1.7 Nouns for counting dates in a month are not segmented (i.e., number+'sun' or number+hao4). A.1.8. Nouns for counting days in a week are not segmented. A.1.9 Nouns for counting years are not segmented (i.e., number+year). A.1.10 Nouns for counting hours, minutes, and seconds are segmented (i.e., number + hour/minute/second). A.1.11 Time prefixes are not separated from the nouns. qian2tian1 day before yesterday hou4nian2 year after the next da4hou4tian1 two days after tomorrow A.2 Proper Nouns A.2.1 Family name is segmented as distinct from given name. A.2.2 Foreign names are segmented according to the segmentation of the original language. ka3er3 ma3ke4si1 Karl Marx A.2.3 Titles are separated from names. zhang1 jiao4shou4 Professor Zhang A.2.4 The familiarity morphemes and honorific morphemes are not separated from the name to which they are attached. lao3- old xiao3- little -lao3 old -zong3 chief A.2.5 Words used in signaling the relative sequence in kinship terms are segmented. da4 ge1 elder brother san1 shu1 third uncle A.2.6. Place names, insitution names, and nationality names are not segmented, including words with certain suffixes as exemplified below. -sheng3 pronvince -xian4 county -zu2 nationality -shan1 mountain Bei3jing1shi4 Beijing City huang2he2 Yellow River A.2.7 The proper noun portion of a commercial product is separated from the common noun part(s). yong3jiu3 pai2 zi4xing2che1 Yongjiu brand bicycle B. VERBS B.1 Reduplicated mono-syllabic and multi-syllabic verbs are not segmented. kan4kan4 have a look lai2lai2wang3wang3 come and go B.2 Negation markers are separated from the verb, except for some conventionalized cases where the negation marker combines with the verb to form a meaning other than simple negation. B.3 A-not-A is segmented unless the segmentation results in any incomplete word structure. Segmented: xiang1xin4 bu4 xiang1xin4 believe or not Not segmented: xiang1bu4xiang1xin4 believe it or not B.4 Bi-syllabic verb-object VPs are not segmented if they are collocationally linked. If the relationship is rather loose, or if there are many parallel VPs, or if they are separated by other constituents, they are segmented. Not segmented: kai1hui4 have a meeting chi1fan4 have a meal Segmented: chi1 yu2 eat fish xie3 xin4 write a letter chi1 liang3 dun4 fan4 have two meals B.5 Bi-syllabic verb-complement compounds are not segmented if they are collocationally linked. If the VP has three or more syllables, the verb is separated from the complement. Not segmented: da3dao3 down with ti2gao1 raise Segmented: shuo1 qing1chu3 say clear B.6 Compound verbs in the potential forms are segmented except for those that can only occur in the potential forms. Segmented: da3 de5 dao3 able to beat down ti2 bu4 gao1 fail to raise Not segmented: mai3de5qi3 afford to buy B.7 Modifier-Head verbs are not segmented only if they are used together frequently or have a specific meaning beyond the combinatorial meaning. Not segmented: hu2nao4 run wild si3ji4 memorize mechanically Segmented: zao3 lai2 come early chong2 shuo1 say again B.8 Compound directional verbs are not segmented unless they occur in the potential forms. Not segmented: chu1qu5 go out; out jin4lai5 come in; in Segmented: chu1 de5 qu4 able to go out chu1 bu2 qu4 cannot go out B.9 Directional postverbs are separated from the main verb. ji4 lai2 post arrive (mail sent to speaker]) pao3 chu1qu5 run out B.10 Verbs in the serial construction without conjunctions are segmented if the orginal meaning of each component is not changed. C. ADJECTIVES C.1 Reduplicated adjectives of the form AA, AABB, ABB, AAB, A-li3-AB are not segmented. da4da4 big gao1gao1xing4xing4 joyful C.2 Complex adjectives of the form yi1-A-yi1-B, yi1-A-er4-B, ban4-A-ban4-B, ban4-A-bu4-B, you3-A-you3-B are not segmented. yi1xin1yi1yi4 whole-hearted you3tiao2you3li3 neat C.3 Compound adjectives are segmented unless the compound has acquired new meanings or changed the word class. Segmented: da4 xiao3 chi3cun4 big and small sizes Not segmented: da4xiao3 size C.4 Complex color adjectives are not segmented. qian3huang2 light-yellow gan3lan3lu:4 olive-green C.5 Negation markers are separated from adjectives, except for conventionalized cases in which the negation marker combines with the adjective to form a meaning other than simple negation. D. PRONOUNS D.1 Plural pronouns (singular pronoun+men2) are not segmented. D.2 Demonstratives, including wh-word-classifiers, are not segmented if the classifier is from the following set: ge4 xie1 yang4 me5 li3 bian1 Otherwise they are segmented, including any intervening numbers. Segmented: zhe4 shi2 tian1 these ten days D.3 Wh-pronouns are treated as single units. D.4 The following quantifying and demonstrative words are separated from the neighboring classifiers. ge4 each, every mei3 each, every mou3 some ben3 this gai1 this, that quan2 all E. NUMBERS E.1 Cardinal numbers 1-100 are all listed as unitary in the lexicon. E.2 Cardinal numbers that are multiples of 100, 1000, 10000, and 100000000 are listed as unitary in the lexicon. E.3 The number 1,000,000 is separated from the preceding number, and its multiples are not listed in the lexicon. yi1 bai3wan4 one million san1 bai3wan4 three million E.4 The word for expressing ordinality, 'di4', is separated from any following number. (Conventionalized phrases like "First World War" are exceptions to this principle, and are listed as unsegmented in the lexicon.) E.5 Numbers are separated from classifiers. E.6 The following words denoting approximate numerical counts are not segmented from the preceding numbers: -duo1 -lai2 -ji3 (Similar to English "thirty-something," "thirty-odd," etc.). However, the following words are separated from the numbers they modify: jin4 yue1 shu4 (Similar to English "about thirty"). E.7 The words 'cheng2' and 'shang4' are not separated from the number following them. E.8 Words denoting degree, following adjectives or verbs (like English "a little"), are separated from the preceding word; these include e.g., xie1 yi4xie1 dian3er2 yi1dian3er2 E.9 The fraction word 'fen1zi1' is treated as a single unit. The percentage fraction expression 'bai3fen1zi1' is also treated as a single unit. F. CLASSIFIERS F.1 Classifiers are usually separated from preceding numbers. However, the number 1 is not segmented from the following classifier if the expression does not mean counting, e.g., mei3 yi1ge5 each one(anything) nei3 yi1ge5 that one(anything) This means that the sequence 'yi1-classifier' may have two entries in the lexicon. F.2 Reduplicated classifiers are not segmented. F.3 Compound and complex classifiers are not segmented. F.4 The following compound words are not segmented: kuai4qian2 fen1zhong1 dian3zhong1 miao3zhong1 G. ADVERBS G.1 Adverbs are separated from any word they modify. G.2 The following adverbial phrases are treated as single units. yue4lai2yu4 more and more bu4de2bu4 have to bu4neng2bu4 cannot but H. Function words including compound/complex forms of the following classes are treated as single uits and separated from any word with which they occur: H.1 Preposition/Postposition H.2 Structure markers de5, de5, de5 and zi1. H.3 Aspect markers zhe5, le5 and guo5. H.4 Pre-verbal structure marker suo3. H.5 Interjections H.6 Question/mode markers H.7 Conjuctions H.8 Onomatopoeic words I. MISC PHRASAL PRINCIPLES I.1 Four-character idioms/proverbs are not segmented. I.2 Four-character phrases that are relatively stable are not segmented. I.3 Proverbs that have five or more characters are segmented according the principles above unless the segmentation would change the meaning. I.4 Acronyms are not segmented. I.5 R-suffixed words are not segmented. I.6 Words that are phonetically borrowed from foreign languages are not segmented. ji2pu3 jeep qiao3ke4li4 chocolate