Japanese Web N-gram Corpus Version 1 Copyright 2007 Google Inc. All Rights Reserved 1. Introduction "Japanese Web N-gram Corpus" is a dataset of Japanese "word" n-grams and their observed frequency counts. This dataset will be useful for research in areas such as statistical machine translation, language modeling, speech recognition, and others. 1.1 Training Corpus The n-grams were extracted from publically accessible Web pages that were crawled by Google. We excluded pages requiring user authentication, pages containing "noarchive" or "noindex" meta tags, and pages under other special restrictions. We aimed to process only Japanese pages, but the corpus might have contained some pages in other languages because of language detection errors. 1.2 Date of Data Collection This dataset was created from Web pages that were crawled in July 2007. 2. Dataset Details 2.1 Preprocessing of Documents Before collecting the n-grams, the Web pages were converted into UTF-8 encoding, normalized into Normalization Form KC (see below), and split into sentences. Ill-formed sentences were filtered out, and the remaining sentences were segmented into "words". 2.1.1 Encoding conversion All Web pages were converted into UTF-8 before processing. The n-grams in this dataset are encoded in UTF-8. 2.1.2 Normalization All strings were normalized into Normalization Form KC (NFKC), which is described in http://www.unicode.org/unicode/reports/tr15/. Here are the most important normalization rules for Japanese strings: - Full-width letters/digits were converted to ASCII letters/digits - Half-width katakana were converted to full-width katakana - Glyphs for Roman digits were converted to ASCII characters (e.g. Ⅲ → III) - Certain Japanese-specific symbols were converted (e.g.㈱ → (株), ㌧ → トン) We used ICU (see http://www-306.ibm.com/software/globalization/icu/index.jsp) to perform the normalization. Here is the code snippet that we used: UnicodeString src = "original string"; UnicodeString dst; UErrorCode status = U_ZERO_ERROR; Normalizer::normalize(src, UNORM_NFKC, 0, dst, status); if (U_SUCCESS(status)) { return dst; } else { return src; // Use the original string when failed } 2.1.3 Sentence Splitting We split sentences by using the characters ".", "!", "?", ".", "。", (full-width period), "!" (full-width exclamation mark), and "?" (full-width question mark) as delimiters. Note that this simple heuristic causes incorrect sentence breaks for sentences containing these characters in expressions like "モーニング 娘。" or "Yahoo!". 2.1.4 Sentence Filtering We filtered out sentences that meet any of the following conditions: 1. Shorter than 6 unicode characters or longer than 1023 unicode characters. 2. Hiragana ratio of less than 5%. 3. Japanese character ratio of less than 70%. We regarded unicode characters whose UCS2 code point is in the range of [U3040, U30FF], [U31F0, U31FF], [U3400, U34BF], [U4E00, U0FFF] or [UF900, UFAFF] as "Japanese characters." 2.1.5 Segmentation We segmented the pre-processed sentences into Japanese "words" using Mecab. Specifically, we used the mecab-0.96 and mecab-ipadic-2.7.0-20070801 packages, which were available at http://mecab.sourceforge.net. We did not perform any post-processing on Mecab's output, so any segmentation errors are reflected directly in the dataset. Details on installing Mecab: % tar zxfv mecab-0.96.tar.gz % cd mecab-0.96 % ./configure % make # make install Details on installing IPADIC % tar zxfv mecab-ipadic-20070801.tar.gz % cd mecab-ipadic-20070801 % ./configure --with-charset=utf8 % make # make install 2.2 Unknown Tokens Any "words" that were not in the vocabulary due to the frequency cutoff (see Section 3.1) were considered unknown words, and were replaced with the token "". 2.3 Sentence Boundary Markers The special tokens "" and "" are used to represent the begining and the end of sentences, respectively. The unigram frequency of "" is equal to the total number of processed sentences. 3. Frequency Cut-offs 3.1 Vocabulary We restricted the vocabulary to "words" that appeared at least 50 times in the processed sentences. Less frequent words were replaced with "" special token. 3.2 N-grams This n-gram dataset contains only n-grams that appear at least 20 times in the processed sentences. Less frequent n-grams were simply discarded. 4. Data directory/file structure 4.1 Top-level Directories Each DVD contains "doc/" and "data/" top-level directories. The "doc/" directory contains the documentation, and the "data/" directory contains the n-gram data. The n-gram data was split across DVDs based on disk capacity. The index files were included in every disk. (See Section 4.4.2 for the details.) 4.2 "data/" directory Each "data/" directory contains subdirectories called "1gms", "2gms", ..., "7gms". Each of these subdirectories contains unigram, bigram, ... 7gram data, respectively. The "1gms/" subdirectory also contains auxiliary files. (See Section 4.3) 4.3 "data/1gms" directory 4.3.1 vocab.gz vocab.gz contains the vocabulary, compressed by gzip. Each line of the vocabulary file is a tab-seperated pair of "word" and frequency. The vocabulary is sorted by the byte order of the words' UTF8 representation. 4.3.2 vocab_cs.gz This file contains the same information as vocab.gz, but it is sorted by frequency. 4.4 1gms ... 7gms Directories 4.4.1 Ngm-KKKKK.gz (ngram files) For the n-gram files, "N" indicates the order of the n-grams, and "KKKK" is the sequence number of the Ngm-* files. Each file contains 10 million n-grams sorted by token. Each line of the file is a tab-seperated pair of n-gram and frequency. Each n-gram is a space-seperated sequence of "words". WORD_1 WORD_2 ... WORD_N FREQUENCY 4.4.2 Ngm.idx (index file) The index file can be used to determine which file contains a given n-gram. It lists the first n-gram in each n-gram file. Each line is a tab-seperated pair of the n-gram filename, and the first n-gram in the n-gram file. 5. Data size The total compressed data size is about 26GB. Number of tokens: 255,198,240,937 Number of sentences: 20,036,793,177 Number of unique unigrams: 2,565,424 Number of unique bigrams: 80,513,289 Number of unique trigrams: 394,482,216 Number of unique 4-grams: 707,787,333 Number of unique 5-grams: 776,378,943 Number of unique 6-grams: 688,782,933 Number of unique 7-grams: 570,204,252 Please note that "number of tokens" means the total number of "words" in the Web pages before filtering out words due to the frequency cutoff. 6. How to Refer to this Dataset If you publish results based on this dataset, please refer to it as Taku Kudo and Hideto Kazawa, "Japanese Web N-gram Corpus Version 1". 7. Contact If you have any questions, please contact japanese-corpus@google.com Taku Kudo and Hideto Kazawa Google Inc. August 2007