Xinhua News Agency Availability: CD-ROM Data type: Text Text type: Journalistic (newswire) Domain(s): International News Language: Mandarin Chinese General Description: The xinhua/ directory contains newswire articles from the Xinhua News Agency, the official newswire service of the government of the People's Republic of China. The data is collected via telephone/modem at the Linguistic Data Consortium (LDC). Publisher and place of publication: Xinhua News Agency, Beijing, China Collector of Data: Linguistic Data Consortium Collection time span: 1994-1996 Description of file organization: three files per month. For example, xh9602_2 contains Xinhua newswire service ("xh") articles from the second part of February 1996 (Feb. 11-19). Exception: xh96_567 contains all articles from 96/5/6 to 96/7/18. There may also be some mixing and overlap among the other files due to reception problems. Number of files: 60 Total size: 62 megabytes; about 25 million text characters (2.5% ASCII, 97.5% GB-encoded 16-bit) Tagging description: The format uses a labeled bracketing, expressed in the style of SGML (Standard Generalized Markup Language). Each article is enclosed in ... markers, and the main content of the article should be enclosed in ..., with apparent paragraph and sentence boundaries marked by

...

and ..., respectively. There are also several header fields: DOCNO article ID string DATE date and time of article HEADER miscellaneous identifiers; appear in some articles after 95/10 As the beginning of the TEXT section, this should be a headline. However, this label occasionally occurs in the middle of the text, where it probably indicates a subheading or a list item. Characters are encoded in the "GB" system used in the People's Republic of China. To view files conveniently in MULE (Multi-lingual Emacs), you may want to use a simple shell script like the one provided in the tools/ directory. Contact for questions or to report errors: ldc@ldc.upenn.edu