Mandarin News Text Corpus Availability: CD-ROM Data type: Text Text type: Journalistic Domain(s): International News Language: Mandarin Chinese General Description: The Mandarin News Text Corpus includes text from various journalistic sources: - newspaper text from Renmin Ribao (People's Daily) - radio scripts from China Radio International - newswire text from Xinhua newswire service The articles cover a variety of topics, including international and domestic news, sports, and culture. Publishers: China Radio International, Beijing, People's Republic of China Renmin Ribao (People's Daily), Beijing, People's Republic of China Xinhua News Agency, Beijing, China Collector of Data: Linguistic Data Consortium Collection time span: 1991-1996 Description of file organization: various - see individual docfiles Number of files: 420 data files Total size: 570 megabytes; about 250 million text characters (3% ASCII, 97% GB-encoded 16-bit) Tagging description: The format uses a labeled bracketing, expressed in the style of SGML (Standard Generalized Markup Language). Each article (originally a separate file) is enclosed in ... markers, and the main content of the article should be enclosed in ..., with apparent paragraph boundaries marked by

or

. We have also retained header fields provided by the sources, which provide information such as topic, date, and article ID -- see individual docfiles for details. Characters are encoded in the "GB" system used in the People's Republic of China. To view files conveniently in MULE (Multi-lingual Emacs), you may want to use a simple shell script like the one provided in the tools/ directory. Quality Control: We made some efforts at removing particularly noisy material: - We eliminated articles that did not contain at least 10 GB characters in the portion of the article. - We removed articles that contained unpaired high-bit characters. (Since GB characters appear as pairs of high-bit characters, odd-length strings of high-bit bytes are an indication of corrupt GB character encoding.) - Some regular control character sequences that did not appear to indicate overall noisy content were removed, leaving the rest of the article intact. - For the remaining control and null characters, we generally removed the entire affected article. As a result of these efforts, there should be no null or control characters in the text files. However, some noisy material most likely remains, since not all errors conveniently flag themselves with illegal characters. In particular, we noticed a number of charts of non-prose material in ch_radio that we did not attempt to remove, as they would have been difficult to consistently identify and someone may find them useful. Credits: Rebecca Finch made the arrangements necessary to collect the text, and David Graff supervised the data collection. Zhibao Wu wrote the original programs to convert the material into SGML. Robert MacIntyre handled most of the systems work necessary to assemble this CD-ROM release, including file management, quality control, and documentation. Yuan Shan Tung provided invaluable assistance with quality control. Contact for questions or to report errors: ldc@ldc.upenn.edu