Web 1T 5-gram Version 1
|Item Name:||Web 1T 5-gram Version 1|
|Author(s):||Thorsten Brants, Alex Franz|
|LDC Catalog No.:||LDC2006T13|
|Release Date:||September 19, 2006|
|Data Source(s):||web collection|
Web 1T 5-gram Version 1 Agreement
|Online Documentation:||LDC2006T13 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Brants, Thorsten, and Alex Franz. Web 1T 5-gram Version 1 LDC2006T13. Web Download. Philadelphia: Linguistic Data Consortium, 2006.|
This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. We expect this data will be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.
The n-gram counts were generated from approximately 1 trillion word tokens of text from publicly accessible Web pages.
The input encoding of documents was automatically detected, and all text was converted to UTF8.
The data was tokenized in a manner similar to the tokenization of the Wall Street Journal portion of the Penn Treebank. Notable exceptions include the following:
- Hyphenated word are usually separated, and hyphenated numbers usually form one token.
- Sequences of numbers separated by slashes (e.g. in dates) form one token.
- Sequences that look like urls or email addresses form one token.
Data SizesFile sizes: approx. 24 GB compressed (gzip'ed) text files Number of tokens: 1,024,908,267,229 Number of sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of fivegrams: 1,176,470,663
The following is an example of the 3-gram data contained this corpus:ceramics collectables collectibles 55 ceramics collectables fine 130 ceramics collected by 52 ceramics collectible pottery 50 ceramics collectibles cooking 45 ceramics collection , 144 ceramics collection . 247 ceramics collection 120 ceramics collection and 43 ceramics collection at 52 ceramics collection is 68 ceramics collection of 76 ceramics collection | 59 ceramics collections , 66 ceramics collections . 60 ceramics combined with 46 ceramics come from 69 ceramics comes from 660 ceramics community , 109 ceramics community . 212 ceramics community for 61 ceramics companies . 53 ceramics companies consultants 173 ceramics company ! 4432 ceramics company , 133 ceramics company . 92 ceramics company 41 ceramics company facing 145 ceramics company in 181 ceramics company started 137 ceramics company that 87 ceramics component ( 76 ceramics composed of 85 ceramics composites ferrites 56 ceramics composition as 41 ceramics computer graphics 51 ceramics computer imaging 52 ceramics consist of 92
The following is an example of the 4-gram data in this corpus:serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234 serve as the industrial 52 serve as the industry 607 serve as the info 42 serve as the informal 102 serve as the information 838 serve as the informational 41 serve as the infrastructure 500 serve as the initial 5331 serve as the initiating 125 serve as the initiation 63 serve as the initiator 81 serve as the injector 56 serve as the inlet 41 serve as the inner 87 serve as the input 1323 serve as the inputs 189 serve as the insertion 49 serve as the insourced 67 serve as the inspection 43 serve as the inspector 66 serve as the inspiration 1390 serve as the installation 136 serve as the institute 187 serve as the institution 279 serve as the institutional 461 serve as the instructional 173 serve as the instructor 286 serve as the instructors 161 serve as the instrument 614 serve as the instruments 193 serve as the insurance 52 serve as the insurer 82 serve as the intake 70 serve as the integral 68