Japanese Web N-gram Corpus Version 1

Copyright 2007 Google Inc.
All Rights Reserved

1. Introduction 

"Japanese Web N-gram Corpus" is a dataset of Japanese "word" n-grams and their
observed frequency counts. This dataset will be useful for research in
areas such as statistical machine translation, language modeling,
speech recognition, and others.

1.1 Training Corpus

The n-grams were extracted from publically accessible Web pages that
were crawled by Google. We excluded pages requiring user
authentication, pages containing "noarchive" or "noindex" meta tags,
and pages under other special restrictions.

We aimed to process only Japanese pages, but the corpus might have
contained some pages in other languages because of language detection
errors.

1.2 Date of Data Collection

This dataset was created from Web pages that were crawled in July
2007.


2. Dataset Details

2.1 Preprocessing of Documents

Before collecting the n-grams, the Web pages were converted into
UTF-8 encoding, normalized into Normalization Form KC (see below), and
split into sentences. Ill-formed sentences were filtered out, and the
remaining sentences were segmented into "words".

2.1.1 Encoding conversion

All Web pages were converted into UTF-8 before processing. The n-grams in
this dataset are encoded in UTF-8.

2.1.2 Normalization

All strings were normalized into Normalization Form KC (NFKC), which
is described in http://www.unicode.org/unicode/reports/tr15/. Here are
the most important normalization rules for Japanese strings:

  - Full-width letters/digits were converted to ASCII letters/digits

  - Half-width katakana were converted to full-width katakana

  - Glyphs for Roman digits were converted to ASCII characters (e.g. Ⅲ → III)

  - Certain Japanese-specific symbols were converted (e.g.㈱ → (株), ㌧ → トン)


We used ICU (see http://www-306.ibm.com/software/globalization/icu/index.jsp)
to perform the normalization. Here is the code snippet that we used:

  UnicodeString src = "original string";
  UnicodeString dst;
  UErrorCode status = U_ZERO_ERROR;
  Normalizer::normalize(src, UNORM_NFKC, 0, dst, status);
  if (U_SUCCESS(status)) {
    return dst;
  } else {
    return src;   // Use the original string when failed
  }

2.1.3 Sentence Splitting

We split sentences by using the characters ".", "!", "?", "．", "。",
(full-width period), "！" (full-width exclamation mark), and "？"
(full-width question mark) as delimiters.

Note that this simple heuristic causes incorrect sentence breaks for
sentences containing these characters in expressions like "モーニング
娘。" or "Yahoo!".

2.1.4 Sentence Filtering

We filtered out sentences that meet any of the following conditions:

 1. Shorter than 6 unicode characters or longer than 1023 unicode characters.
 2. Hiragana ratio of less than 5%.
 3. Japanese character ratio of less than 70%.

We regarded unicode characters whose UCS2 code point is in the range of  [U3040,
U30FF], [U31F0, U31FF], [U3400, U34BF], [U4E00, U0FFF] or [UF900,
UFAFF] as "Japanese characters."

2.1.5 Segmentation

We segmented the pre-processed sentences into Japanese "words" using
Mecab.  Specifically, we used the mecab-0.96 and
mecab-ipadic-2.7.0-20070801 packages, which were available at
http://mecab.sourceforge.net.

We did not perform any post-processing on Mecab's output, so any
segmentation errors are reflected directly in the dataset.

Details on installing Mecab:

 % tar zxfv mecab-0.96.tar.gz
 % cd mecab-0.96
 % ./configure
 % make
 # make install
 
Details on installing IPADIC

 % tar zxfv mecab-ipadic-20070801.tar.gz
 % cd mecab-ipadic-20070801
 % ./configure --with-charset=utf8
 % make
 # make install


2.2 Unknown Tokens

Any "words" that were not in the vocabulary due to the frequency
cutoff (see Section 3.1) were considered unknown words, and were
replaced with the token "<UNK>".


2.3 Sentence Boundary Markers

The special tokens "<S>" and "</S>" are used to represent the begining
and the end of sentences, respectively. The unigram frequency of "<S>"
is equal to the total number of processed sentences.


3. Frequency Cut-offs

3.1 Vocabulary

We restricted the vocabulary to "words" that appeared at least 50 times
in the processed sentences. Less frequent words were replaced with
"<UNK>" special token.

3.2 N-grams

This n-gram dataset contains only n-grams that appear at least 20
times in the processed sentences. Less frequent n-grams were simply
discarded.


4. Data directory/file structure

4.1 Top-level Directories

Each DVD contains "doc/" and "data/" top-level directories. The "doc/"
directory contains the documentation, and the "data/" directory
contains the n-gram data. The n-gram data was split across DVDs based
on disk capacity. The index files were included in every disk. (See
Section 4.4.2 for the details.)

4.2 "data/" directory

Each "data/" directory contains subdirectories called "1gms", "2gms",
..., "7gms".  Each of these subdirectories contains unigram, bigram,
... 7gram data, respectively. The "1gms/" subdirectory also contains
auxiliary files. (See Section 4.3)

4.3 "data/1gms" directory

4.3.1 vocab.gz

vocab.gz contains the vocabulary, compressed by gzip. Each line of the
vocabulary file is a tab-seperated pair of "word" and frequency. The
vocabulary is sorted by the byte order of the words' UTF8 representation.

4.3.2 vocab_cs.gz

This file contains the same information as vocab.gz, but it is sorted by frequency.

4.4 1gms ... 7gms Directories

4.4.1 Ngm-KKKKK.gz (ngram files)

For the n-gram files, "N" indicates the order of the n-grams, and
"KKKK" is the sequence number of the Ngm-* files. Each file contains
10 million n-grams sorted by token.

Each line of the file is a tab-seperated pair of n-gram and
frequency. Each n-gram is a space-seperated sequence of "words".

WORD_1 <space> WORD_2 <space> ... WORD_N <tab> FREQUENCY

4.4.2 Ngm.idx (index file)

The index file can be used to determine which file contains a given
n-gram.  It lists the first n-gram in each n-gram file. Each line is a
tab-seperated pair of the n-gram filename, and the first n-gram in the
n-gram file.


5. Data size

The total compressed data size is about 26GB.

 Number of tokens:         255,198,240,937
 Number of sentences:       20,036,793,177
 Number of unique unigrams:      2,565,424
 Number of unique bigrams:      80,513,289
 Number of unique trigrams:    394,482,216
 Number of unique 4-grams:     707,787,333
 Number of unique 5-grams:     776,378,943
 Number of unique 6-grams:     688,782,933
 Number of unique 7-grams:     570,204,252

Please note that "number of tokens" means the total number of "words"
in the Web pages before filtering out words due to the frequency cutoff.


6. How to Refer to this Dataset

If you publish results based on this dataset, please refer to it as
Taku Kudo and Hideto Kazawa, "Japanese Web N-gram Corpus Version 1".


7. Contact

If you have any questions, please contact japanese-corpus@google.com


Taku Kudo and Hideto Kazawa
Google Inc.

August 2007