Chinese <-> English Name Entity Lists v 1.0

Item Name: Chinese <-> English Name Entity Lists v 1.0
Author(s): Shudong Huang
LDC Catalog No.: LDC2005T34
ISBN: 1-58563-368-2
ISLRN: 410-883-638-016-6
DOI: https://doi.org/10.35111/85gr-tb32
Release Date: November 29, 2005
Member Year(s): 2005
DCMI Type(s): Text
Data Source(s): newswire
Application(s): cross-lingual information retrieval, information detection, information extraction, information retrieval, topic detection and tracking
Language(s): Mandarin Chinese
Language ID(s): cmn
Online Documentation: LDC2005T34 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Huang, Shudong. Chinese <-> English Name Entity Lists v 1.0 LDC2005T34. Web Download. Philadelphia: Linguistic Data Consortium, 2005.

Introduction

Chinese <-> English Name Entity Lists v 1.0 was developed by the Linguistic Data Consortium (LDC) and contains nine pairs of Chinese-English bi-directional name entity lists compiled from Xinhua News Agency newswire texts. The Chinese to English lists contain approximately 400,000 entities, and the English to Chinese lists contain approximately 435,000 entities.

Data

Not every irregularity in the original source has been detected and normalized. Some Chinese characters are not encoded in the source and brackets are used to describe their composition. Except for the person name lists, most instances were left untouched in the created lists. An effort was made to replace GB-encoded characters (such as Roman numbers) in the English translation with ASCII characters. However no attempt has been made to do the opposite for Chinese names.

The use of slashes as delimiters presents another problem. Some names may have internal slashes. Initially, double quotes ("") were used to enclose the name with an internal slash to avoid confusion without realizing that they would be just a single " in ASCII (as opposed to a set of enclosing " in GB). Later it was decided to use &slash;. In future releases, some lists will be changed for greater consistency. Finally, most of the English names in the source use lowercase throughout. An effort was made to capitalize the initial letter (and possibly some middle ones) for person names, but not for any other kind of names as most other names have multiple words, some of which may contain articles and prepositions.

The word "English" is somewhat misleading here. Although most of the foreign words are English or can appear in English texts, there are also many non-English words written in Roman alphabet, some of which may have English equivalents while others do not. No efforts have been made to eliminate those non-English names where English equivalents are available.

The English to Chinese version of each pair was created by reversing the Chinese to English, both sorted by the Unix built-in sort function.

The contents are as follows:

Chinese to English

Place Names 276,382
Organization Names 30,800
Corporate Names 54,747
Press Organization Names 29,757
Intl. Organization Names 7,040
Total 398,726

English to Chinese

Place Names 298,993
Organization Names 37,145
Corporate Names 58,468
Press Organization Names 32,922
Intl. Organization Names 7,040
Total 434,568

Samples

For an example of the data in this publication, please view this sample (TXT) from the corporate names list.

Updates

None at this time.

Additional Licensing Instructions

This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.

Available Media

View Fees





Login for the applicable fee