Mandarin News Text Corpus
Availability: CD-ROM
Data type: Text
Text type: Journalistic
Domain(s): International News
Language: Mandarin Chinese
General Description:
The Mandarin News Text Corpus includes text from various journalistic sources:
- newspaper text from Renmin Ribao (People's Daily)
- radio scripts from China Radio International
- newswire text from Xinhua newswire service
The articles cover a variety of topics, including international and
domestic news, sports, and culture.
Publishers:
China Radio International, Beijing, People's Republic of China
Renmin Ribao (People's Daily), Beijing, People's Republic of China
Xinhua News Agency, Beijing, China
Collector of Data: Linguistic Data Consortium
Collection time span: 1991-1996
Description of file organization: various - see individual docfiles
Number of files: 420 data files
Total size:
570 megabytes;
about 250 million text characters (3% ASCII, 97% GB-encoded 16-bit)
Tagging description:
The format uses a labeled bracketing, expressed in the style of SGML
(Standard Generalized Markup Language). Each article (originally a
separate file) is enclosed in
or
. We have also retained header
fields provided by the sources, which provide information such as topic,
date, and article ID -- see individual docfiles for details.
Characters are encoded in the "GB" system used in the People's Republic of
China. To view files conveniently in MULE (Multi-lingual Emacs), you may
want to use a simple shell script like the one provided in the tools/
directory.
Quality Control:
We made some efforts at removing particularly noisy material:
- We eliminated articles that did not contain at least 10 GB characters in
the