Chinese <-> English Name Entity Lists (v1.0) Shudong Huang Linguistic Data Consortium December, 2002 This is a release of the Chinese-English bi-directional name entity lists, compiled from Xinhua's database. In this release, not every irregularity in the original source has been detected and normalized. Some Chinese characters are not encoded in the source and brackets are used to describe their composition. Except for the person name lists, we kept most such cases untouched in the created lists. We made an effort to replace GB-encoded characters (such as Roman numbers) in the English translation with ASCII characters, but no effort has been made to do the opposite for Chinese names. Another thing is use of slashes as delimiters. Some names may have internal slashes. Initially, We used "" to enclose the name with an internal slash to avoid confusion without realizing that these is just one " in ASCII (as oppose a set of enclosing " in GB), but later we decided to use &slash;. We'll need to change some earlier lists for consistency. Finally, most of the English names in the source use lower cases throughout. We made an effor to capitalize the initial letter (and possible some middle ones) for person names, but not on any other kind of names as most other names have multiple words some of which may contain articles and prepositions. Note also that the word "English" is somewhat misleading here. Although most of foreign words are English or can appear in English texts, there are also many non-English words written in Roman alphabet - some of which may have English equivalents while others do not. No efforts have been made to eliminate those non-English names where English equivlants are available. The entire set consists of 9 pairs of lists. The English->Chinese version of each pair was created by reversing the Chinese->English, both sorted by the Unix built-in sort function. Person Names ============ ldc_propernames_people_ce_v1.beta.txt total: 486,212 Examples: 阿阿贝伊奥卢 /Agabeyoglu/ 阿阿卡伊 /Agakay/ ldc_propernames_people_ec_v1.beta.txt total: 572,213 Examples: A'Mer /阿米尔/奥马尔/ Aach /阿赫/ The person lists have just two fields on each line. The original data also has information on the country/region for most of the entries. Since most of the names with the same spelling are translated with the same sequence of Chinese characters, we think there is no need to include the information as an extra field. Place names =========== ldc_propernames_place_ce_v1.beta.txt total: 276,382 Examples: 阿巴登塔佩 /abaden tappeh/ Iran 阿巴拉契亚大谷地 /great appalachian v./ USA_Canada ldc_propernames_place_ec_v1.beta.txt total: 298,993 Examples: aba bouyyi /阿巴布伊/ Ethiopia aba as suud /艾巴苏欧德/ Saudi Arabia The place name lists have a third field for a "larger" geographical body. For example, the Mississippi is in the US, so US would be in the third field. Since some geographical bodies may extend to several countries/regions, the third field may have more than one country/region listed. In such cases, an underscore "_" is used to "link" these countries/regions together. However, we might want to replace the underscore with a hyphen as this is used in other lists, or something else as a couple of country/region names have a hyphen in them. Organization, industry, and press names ======================================= ldc_propernames_org_ce_v1.beta.txt total: 30,800 Examples: 阿尔伯塔大学 /university of alberta/ Canada GT环球开发基金会 /gt global development fund/ USA ldc_propernames_org_ec_v1.beta.txt total: 37,145 Examples: Baroque Chamber Chorus (Beijing) /北京巴洛克室内合唱团/ China B1901 /B1901俱乐部足球队/ Denmark ldc_propernames_industry_ce_v1.beta.txt total: 54,747 Examples: EMC能源公司 /emc energies inc./ USA 3COM公司 /3com corp./ USA ldc_propernames_industry_ec_v1.beta.txt total: 58,468 Examples: AT&T China Co., Ltd. /AT&T中国有限公司/ China-USA Asia Pacific Bank /亚太商业银行/ Taiwan ldc_propernames_press_ce_v1.beta.txt total: 29,757 Examples: 阿比让广播电台 /radio abidjan/ Ivory Coast 阿贝赛-安哥拉日报 /abc-diario de angola/ Angola ldc_propernames_press_ec_v1.beta.txt total: 32,922 Examples: aachener nachrichten /亚琛新闻/ Germany Beijing Review /北京周报/ China Eventually, we might want to merge all the above lists under the category of orgnization. The Xinhua source also has a set of organization database NOT under the proper name section, with detailed information about the structure for most organiations (such as the name of the chair, its function, etc.) The list of China's organizations does not have English translations, but the list of international organizations does. The latter one was extracted and resulted in the following two lists: ldc_orgs_intl_ce_v1.beta.txt total: 7,040 ldc_orgs_intl_ec_v1.beta.txt total: 7,040 As it turns out, there's some overlap between *orgs_intl* and *propernames_org*. So in the next official release, we may merge the *orgs_intl* lists into the org lists as well and eliminate duplicates. "Other" names ============= These two lists have proper names for various entities, such as weapons, civilian or military ships and aircrafts, famous novels, awards, holidays, etc. ldc_propernames_other_ce_v1.beta.txt total: 13,007 Examples: “质子”号火箭 /proton booster rocket/ Russia 《悲惨世界》 /les miserables/ France ldc_propernames_other_ec_v1.beta.txt total: 14,066 Examples: F-16 fighter /F-16战斗机/ UNSPECIFIED SAM-6 anti-aircraft missile /“萨姆-6”防空导弹/ Russia Note the third field may be "UNSPECIFIED". Who-Is-Who Lists ================ ldc_whoswho_china_ce_v1.txt; ldc_whoswho_china_ec_v1.txt total: 30,028 each ldc_whoswho_international_ce_v1.txt; ldc_whoswho_international_ec_v1.txt total: 36,881 each The original who-is-who databases have detailed information about the individuals. The texts are usually in Chinese (particularly for Chinese whoswho entities), but some international figures have English texts (but often not parallel) as well. For this reason, we include source databases (ldc_whoswho_china_source.txt and ldc_whoswho_international_source.txt). In the created lists, there's a fourth field, indicating the index whereby one can search for the full content in the source. If you have suggestions or comments, please send them to Shudong Huang at: shudong@ldc.upenn.edu =============================================== Copyright Notice: Portions (c) 2001 Xinhua News Agency, (c) 2002 Trustees of the University of Pennsylvania