China Radio International
Availability: CD-ROM
Data type: Text
Text type: Journalistic - radio scripts
Domain(s): International News
Language: Mandarin Chinese
General Description:
The ch_radio/ directory contains China Radio International broadcast
scripts. The agreement to provide the text to the Linguistic Data
Consortium (LDC) for research purposes was negotiated through the
United Nations Bureau of China Radio International. The text is sent
to the LDC from Beijing on a quarterly basis on DOS floppy disks
containing many small files, which are processed into a single file
containing all of the Chinese data on the disk.
Publisher and place of publication: China Radio International,
Beijing, People's Republic of China
Collector of Data: Linguistic Data Consortium
Collection time span: 1994-1996
Description of file organization: 1 file per received floppy disk.
The filenames have "dates" corresponding to the label of the disk, which
generally describes the date of the actual material within a few days.
Note that a single file usually contains several days worth of material,
and a given day's material may appear in more than one file. The
"artfiltr" program is provided in the bin/ directory for those who wish to
sort the material by date or by department.
The file cr9410xx is something of an exception, as it contains several
disks worth of material from October and November 1994.
Number of files: 288
Total size: 225 megabytes;
about 100 million text characters (5% ASCII, 95% GB-encoded 16-bit)
Tagging description:
The format uses a labeled bracketing, expressed in the style of SGML
(Standard Generalized Markup Language). Each article (originally a
separate file) is enclosed in
.
There are also several header fields:
ID file/article ID string
DATE date of article in "YYYY MM DD" format
HL headline/title
AU originating department (news, sports, etc.)
See "Additional documentation" below for further details.
Characters are encoded in the "GB" system used in the People's Republic of
China. To view files conveniently in MULE (Multi-lingual Emacs), you may
want to use a simple shell script like the one provided in the tools/
directory.
There is some potential for duplication between two files. The last part
of the