Chinese Web 5-gram Version 1

Item Name: Chinese Web 5-gram Version 1
Author(s): Fang Liu, Meng Yang, Dekang Lin
LDC Catalog No.: LDC2010T06
ISBN: 1-58563-539-1
ISLRN: 958-238-545-740-0
DOI: https://doi.org/10.35111/647p-yt29
Release Date: April 19, 2010
Member Year(s): 2010
DCMI Type(s): Text
Data Source(s): web collection
Application(s): language modeling
Language(s): Mandarin Chinese
Language ID(s): cmn
License(s): Chinese Web 5-gram Version 1 Agreement
Online Documentation: LDC2010T06 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Liu, Fang, Meng Yang, and Dekang Lin. Chinese Web 5-gram Version 1 LDC2010T06. Web Download. Philadelphia: Linguistic Data Consortium, 2010.
Related Works: View

Introduction

Chinese Web 5-gram Version 1, Linguistic Data Consortium (LDC) catalog number LDC2010T06 and isbn 1-58563-539-1, was created by researchers at Google Inc. It consists of Chinese word n-grams and their observed frequency counts generated from over 800 million tokens of text. The length of the n-grams ranges from unigrams (single words) to 5-grams. This data should be useful for statistical language modeling (e.g., segmentation, machine translation) as well as for other uses.

Included with this publication is a simple segmenter written in Perl using the same algorithm used to generate the data.

Data Collection

N-gram counts were generated from approximately 883 billion word tokens of text from publicly accessible web pages. This data set contains only n-grams that appeared at least 40 times in the processed sentences. Less frequent n-grams were discarded. While the aim was to identify and collect only Chinese language pages, some text from other languages is incidentally included in the final data.

Data collection took place in March 2008; no text that was created on or after April 1, 2008 was used to develop this corpus.

Preprocessing

The input character encoding of documents was automatically detected, and all text was converted to UTF-8. The data was tokenized by an automatic tool, and all continuous Chinese character sequences were processed by the segmenter.

The following types of tokens are considered valid:

  • A Chinese word containing only Chinese characters.
  • Numbers, e.g., 198, 2,200, 2.3, etc.
  • Single Latin tokens, such as Google, &ab, etc.

Extent of Data

  • File sizes: approx. 30 GB compressed (gzip'ed) text files
  • Number of tokens: 882,996,532,572
  • Number of sentences: 102,048,435,515
  • Number of unigrams: 1,616,150
  • Number of bigrams: 281,107,315
  • Number of trigrams: 1,024,642,142
  • Number of fourgrams: 1,348,990,533
  • Number of fivegrams: 1,256,043,325

Sample

Sample screen shot

Available Media

View Fees





Login for the applicable fee