The Linguistic Data Consortium (LDC) announces the availability of a Mandarin Chinese text corpus. This corpus includes about 250 million GB-encoded text characters.

The Mandarin News Corpus includes text from various journalistic sources:

  • newspaper text from Renmin Ribao (People's Daily)
  • radio scripts from China Radio International
  • newswire text from Xinhua newswire service
The format of this corpus uses a labeled bracketing, expressed in the style of SGML (Standard Generalized Markup Language). The header fields provided by the sources, which give information such as topic, date and article ID, have been retained. The articles cover a variety of topics, including international and domestic news, sports and culture.

