Japanese Web N-gram Version 1


Item Name: Japanese Web N-gram Version 1
Authors: Taku Kudo and Hideto Kazawa
LDC Catalog No.: LDC2009T08
ISBN: 1-58563-510-3
Release Date: Apr 16, 2009
Data Type: text
Data Source(s): web collection
Application(s): language modeling
Language(s): Japanese
Language ID(s): jpn
Distribution: 6 DVD
Member fee: $0 for 2009 members
Non-member Fee: US $150.00
Reduced-License Fee: US $150.00
Extra-Copy Fee: US $150.00
Non-member License: yes
Member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Taku Kudo and Hideto Kazawa
2009
Japanese Web N-gram Version 1
Linguistic Data Consortium, Philadelphia

Introduction

Japanese Web N-gram Version 1, Linguistic Data Consortium (LDC) catalog number LDC2009T08 and isbn 1-58563-510-3, was created by Google Inc. It consists of Japanese "word" n-grams and their observed frequency counts generated from over 255 billion tokens of text. The length of the n-grams ranges from unigrams to seven-grams.

The n-grams were extracted from publicly accessible web pages that were crawled by Google in July 2007. This data set contains only n-grams that appear at least 20 times in the processed sentences. Less frequent n-grams were simply discarded. Those web pages requiring user authentication, pages containing "noarchive" or "noindex" meta tags, and pages under other special restrictions were excluded from the final release. While the aim was to process only Japanese pages, the corpus may contain some pages in other languages due to language detection errors. This dataset will be useful for research in areas such as statistical machine translation, language modeling and speech recognition, among others.

Data

Before the n-grams were collected, the web pages were converted into UTF-8 encoding, normalized into Unicode Normalization Form KC (see below), and split into sentences. Ill-formed sentences were filtered out, and the remaining sentences were segmented into "words".

All strings were normalized into Unicode Normalization Form KC (NFKC), which is described in http://www.unicode.org/unicode/reports/tr15/. Japanese strings were normalized according to the following rules:

  • Full-width letters/digits were converted to ASCII letters/digits
  • Half-width katakana were converted to full-width katakana
  • Glyphs for Roman digits were converted to ASCII characters
  • Certain Japanese-specific symbols were converted

The vocabulary was restricted to "words" that appeared at least 50 times in the processed sentences.

Statistical information about the corpus is set forth in the following table:

Data size The total compressed data size is about 26GB.
Number of tokens: 255,198,240,937
Number of sentences: 20,036,793,177
Number of unique unigrams: 2,565,424
Number of unique bigrams: 80,513,289
Number of unique trigrams: 394,482,216
Number of unique 4-grams: 707,787,333
Number of unique 5-grams: 776,378,943
Number of unique 6-grams: 688,782,933
Number of unique 7-grams: 570,204,252

Samples

Japanese Bigram Japanese Trigram

Content Copyright

Portions 2007 Google Inc., 2009 Trustees of the University of Pennsylvania