China Radio International Availability: CD-ROM Data type: Text Text type: Journalistic - radio scripts Domain(s): International News Language: Mandarin Chinese General Description: The ch_radio/ directory contains China Radio International broadcast scripts. The agreement to provide the text to the Linguistic Data Consortium (LDC) for research purposes was negotiated through the United Nations Bureau of China Radio International. The text is sent to the LDC from Beijing on a quarterly basis on DOS floppy disks containing many small files, which are processed into a single file containing all of the Chinese data on the disk. Publisher and place of publication: China Radio International, Beijing, People's Republic of China Collector of Data: Linguistic Data Consortium Collection time span: 1994-1996 Description of file organization: 1 file per received floppy disk. The filenames have "dates" corresponding to the label of the disk, which generally describes the date of the actual material within a few days. Note that a single file usually contains several days worth of material, and a given day's material may appear in more than one file. The "artfiltr" program is provided in the bin/ directory for those who wish to sort the material by date or by department. The file cr9410xx is something of an exception, as it contains several disks worth of material from October and November 1994. Number of files: 288 Total size: 225 megabytes; about 100 million text characters (5% ASCII, 95% GB-encoded 16-bit) Tagging description: The format uses a labeled bracketing, expressed in the style of SGML (Standard Generalized Markup Language). Each article (originally a separate file) is enclosed in ... markers, and the main content of the article should be enclosed in ..., with apparent paragraph boundaries marked by

. There are also several header fields: ID file/article ID string DATE date of article in "YYYY MM DD" format HL headline/title AU originating department (news, sports, etc.) See "Additional documentation" below for further details. Characters are encoded in the "GB" system used in the People's Republic of China. To view files conveniently in MULE (Multi-lingual Emacs), you may want to use a simple shell script like the one provided in the tools/ directory. There is some potential for duplication between two files. The last part of the (e.g. "gjxw0522.1" in " raw/950521/tg/gjxw0522.1 ") may be a reliable indicator of article uniqueness -- i.e. articles with identical ID-ends may have identical contents. There are probably several hundred repeated articles, but this represents a fairly small proportion of the data. Contact for questions or to report errors: ldc@ldc.upenn.edu Additional documentation: The DATE and AU information in the SGML tags was extracted from the document IDs (which are basically the DOS filenames). The coding system was described in a letter from CRI's UN correspondent dated December 16, 1994 as follows: "1. Under the codes of /TG, there are news and features in Chinese language filed by International News Department, Domestic News Dept., Editor-in-chief Office, the Literature and Music Dept. and Chinese Language Dept. "The complete codes for these news and features are /TG/(code for dept.)(code for news or features)(date)(serial number). "Codes for depts are as follows: "GJ-International News Dept. "GN-Domestic News Dept. "WY-Literature and Music Dept. "TY-Sports Dept. "ZB-Editor-in-chief's Office "HQB-Chinese Language Dept. "Codes for different types of stories: "XW-news "ZG-features "TG-special features "Codes for date: xx (month) xx (day) "Codes for serial number: 1, 2, 3, 4, 5 ... "For example: /TG/GJXW0912.9 means No. 9 news story filed by International News Dept. on Sept. 12. "2. Under the code of /ZB/XW. there are news in Chinese filed by sections of various departments. "Under the code of /ZB/ZG, there are features in Chinese filed by sections of various departments."