Chinese Web 5-gram Corpus Version 1 Copyright 2008 Google Inc. All Rights Reserved 1. Introduction This data set contains Chinese word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. We expect this data will be useful for statistical language modeling, e.g., for segmentation, machine translation, as well as for other uses. As a by-product, we also include a simple segmenter written in Perl using the same algorithm we used to generate the data. 1.1 Source Data The n-gram counts were generated from approximately 883 billion word tokens of text from Web pages. We used only publicly accessible Web pages. We attempted to use only Web pages with Chinese text, but some text from other languages also found its way into the data. 1.2 Date of Data Collection Data collection took place in March 2008. This means that no text that was created on or after April 1, 2008 was used. 2. Data Preprocessing 2.1 Character Encoding The input encoding of documents was automatically detected, and all text was converted to UTF8. 2.2 Tokenization The data are tokenized by an automatic tool, and all continuous Chinese character sequences are sent to segmenter to do segmentation. Following types of token are consider valid: - A Chinese word containing only Chinese character. - Numbers, e.g 198, 2,200, 2.3 etc. - Single Latin tokens, such as Google, &ab, etc. 2.3 Filtering We attempted to filter out all tokens that do not belong in Chinese word n-gram counts. This includes tokens with any of the following characteristics: - Tokens that are too long. - Tokens containing ASCII control characters. - Tokens made up of a combination of letters, punctuation, and/or digits that does not seem useful. 2.4 The Token "" All filtered tokens, as well as tokens that fell beneath the word frequency cutoff (see 3.1 below), were mapped to the special token "" (for "unknown word"). 2.5 Sentence Boundaries Sentence boundaries were automatically detected. The beginning of a sentence was marked with "", the end of a sentence was marked with "". The inserted tokens "" and "" were counted like other words and appear in the n-gram tables. So, for example, the unigram count for "" is equal to the number of sentences into which the training corpus was divided. 3. Frequency Cutoffs 3.1 Word Frequency Cutoff All tokens (words, numbers, and punctuation) appearing 200 times or more were kept and appear in the n-gram tables. Tokens with lower counts were mapped to the special token "". 3.2 N-gram Frequency Cutoff N-grams appearing 40 times or more were kept, and appear in the n-gram tables. All n-grams with lower counts were discarded. 4. Data Format 4.1 Contents of Top-level Directory Directory "doc": documentation (replicated on dvd1 - dvd7). Directory "data": n-gram data. Directory "segmenter": segmenter code (only on dvd1). 4.2 Contents of "doc" Directory Contains the readme documents: - readme_en: English version - readme_zh: Chinese version 4.3 Contents of "data" Directory There is a total of 394 files: ngrams-[00000-00393]-of-00394.gz in the /data directory in each DVD. The files used to store different orders of ngrams are: unigrams: ngrams-00000-of-00394.gz bigrams: ngrams-[00001-00029]-of-00394.gz trigrams: ngrams-[00030-00132]-of-00394.gz fourgrams: ngrams-[00133-00267]-of-00394.gz fivegrams: ngrams-[00268-00393]-of-00394.gz 4.3.1 Files ngrams-?????-of-00394.gz Each ngrams-KKKKK-of-00394.gz is a gzip'ed file containing the n-gram data, where KKKKK is the zero-padded number of the file, which ranges among [00000, 00393]. Each file contains 10 million n-gram entries. N-grams are unix-sorted. The files are gzip'ed. Each n-gram occupies one line: WORD_1 WORD_2 ... WORD_N COUNT 4.4 Contents of "segmenter" Directory A simple segmenter written in Perl is included in this directory. It is using the same algorithm and data as the one we used when generated this data set. - segmenter.pl: Perl script of the segmenter. - vocab.txt: The word list and corresponding frequents used by the segmenter. The words and frequents are automatically mined from another set of corpus. 5. Data Sizes File sizes: approx. 30 GB compressed (gzip'ed) text files Number of tokens: 882,996,532,572 Number of sentences: 102,048,435,515 Number of unigrams: 1,616,150 Number of bigrams: 281,107,315 Number of trigrams: 1,024,642,142 Number of fourgrams: 1,348,990,533 Number of fivegrams: 1,256,043,325 6. Acknowledging the Data We are very pleased to be able to release this data, and we hope that many groups find it useful in their work. If you use this data, we would like to ask you to acknowledge it in your presentations and publications. We are also interested in hearing what uses this data finds, so we would appreciate hearing from you how you used the data. 7. Contact Information We would welcome comments, suggestions, questions about the contents of this data, suggestions for possible future data sets, and any other feedback. Please send email to: chinese-ngrams@google.com Fang Liu, Meng Yang, Dekang Lin Google Research 04-Jun-2008