Web 1T 5-gram Version 1


Item Name: Web 1T 5-gram Version 1
Authors: Thorsten Brants, Alex Franz
LDC Catalog No.: LDC2006T13
ISBN: 1-58563-397-6
Release Date: Sep 19, 2006
Data Type: text
Data Source(s): web collection
Application(s): language modeling
Language(s): English
Language ID(s): eng
Distribution: 6 DVD, or 2 BD
Member fee: $0 for 2006 members
Non-member Fee: US $150.00
Reduced-License Fee: US $150.00
Extra-Copy Fee: US $150.00
Non-member License: yes
Member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Thorsten Brants, Alex Franz
2006
Web 1T 5-gram Version 1
Linguistic Data Consortium, Philadelphia

Introduction

This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. We expect this data will be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.

Source Data

The n-gram counts were generated from approximately 1 trillion word tokens of text from publicly accessible Web pages.

Character Encoding

The input encoding of documents was automatically detected, and all text was converted to UTF8.

Tokenization

The data was tokenized in a manner similar to the tokenization of the Wall Street Journal portion of the Penn Treebank. Notable exceptions include the following:

  • Hyphenated word are usually separated, and hyphenated numbers usually form one token.
  • Sequences of numbers separated by slashes (e.g. in dates) form one token.
  • Sequences that look like urls or email addresses form one token.

Data Sizes

File sizes: approx. 24 GB compressed (gzip'ed) text files Number of tokens: 1,024,908,267,229 Number of sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of fivegrams: 1,176,470,663

Sample Data

The following is an example of the 3-gram data contained this corpus:

ceramics collectables collectibles 55 ceramics collectables fine 130 ceramics collected by 52 ceramics collectible pottery 50 ceramics collectibles cooking 45 ceramics collection , 144 ceramics collection . 247 ceramics collection 120 ceramics collection and 43 ceramics collection at 52 ceramics collection is 68 ceramics collection of 76 ceramics collection | 59 ceramics collections , 66 ceramics collections . 60 ceramics combined with 46 ceramics come from 69 ceramics comes from 660 ceramics community , 109 ceramics community . 212 ceramics community for 61 ceramics companies . 53 ceramics companies consultants 173 ceramics company ! 4432 ceramics company , 133 ceramics company . 92 ceramics company 41 ceramics company facing 145 ceramics company in 181 ceramics company started 137 ceramics company that 87 ceramics component ( 76 ceramics composed of 85 ceramics composites ferrites 56 ceramics composition as 41 ceramics computer graphics 51 ceramics computer imaging 52 ceramics consist of 92

The following is an example of the 4-gram data in this corpus:

serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234 serve as the industrial 52 serve as the industry 607 serve as the info 42 serve as the informal 102 serve as the information 838 serve as the informational 41 serve as the infrastructure 500 serve as the initial 5331 serve as the initiating 125 serve as the initiation 63 serve as the initiator 81 serve as the injector 56 serve as the inlet 41 serve as the inner 87 serve as the input 1323 serve as the inputs 189 serve as the insertion 49 serve as the insourced 67 serve as the inspection 43 serve as the inspector 66 serve as the inspiration 1390 serve as the installation 136 serve as the institute 187 serve as the institution 279 serve as the institutional 461 serve as the instructional 173 serve as the instructor 286 serve as the instructors 161 serve as the instrument 614 serve as the instruments 193 serve as the insurance 52 serve as the insurer 82 serve as the intake 70 serve as the integral 68

Content Copyright

Portions 2006 Google Inc., 2006 Trustees of the University of Pennsylvania