People's Daily Availability: CD-ROM Data type: Text Text type: Journalistic (newspaper) Domain(s): National, International News Language: Mandarin Chinese General Description: The p_daily/ directory contains newspaper articles from the Beijing-based Renmin Ribao (People's Daily), the largest newspaper published by the government of the People's Republic of China. The agreement for research use of the text was reached with the Foreign Affairs Bureau of Renmin Ribao. The text archive was made available to the LDC in two phases: the first delivery, made in 1994, was made on 100+ floppy disks, and the second, made in 1996, was made on CD-ROM. Publisher and place of publication: Renmin Ribao (People's Daily) Beijing, People's Republic of China Collector of Data: Linguistic Data Consortium Collection time span: 1991-1996 Description of file organization: one file per month. Number of files: 72 Total size: 290 megabytes; about 125 million text characters (1% ASCII, 99% GB-encoded 16-bit) Tagging description: The format uses a labeled bracketing, expressed in the style of SGML (Standard Generalized Markup Language). Each article (originally a separate file) is enclosed in ... markers, and the main content of the article should be enclosed in ..., with apparent paragraph boundaries marked by

or

. Characters are encoded in the "GB" system used in the People's Republic of China. To view files conveniently in MULE (Multi-lingual Emacs), you may want to use a simple shell script like the one provided in the tools/ directory. The header fields vary somewhat from time to time. The first file (pd9101) has only fields to mark headlines. The rest of the 1991 material also includes fields in English and fields for the author. The 1992 material has these same fields, but the dates are GB-encoded Arabic numerals with Chinese characters for year/month/date. From 1993 on, the dates appear in ASCII Arabic numerals, but the GB Chinese characters for year/month/day are still used after each number. (For example, January 12, 1994 appears as "19940112", where indicates the Chinese character for "year", etc.) Finally, the 1994-1995 material uses longer tags of HEADLINE and AUTHOR, and adds tags of and . Contact for questions or to report errors: ldc@ldc.upenn.edu