Web 1T 5-gram Version 1
| Item Name: | Web 1T 5-gram Version 1 |
| Author(s): | Thorsten Brants, Alex Franz |
| LDC Catalog No.: | LDC2006T13 |
| ISBN: | 1-58563-397-6 |
| ISLRN: | 831-344-220-094-6 |
| DOI: | https://doi.org/10.35111/cqpa-a498 |
| Release Date: | September 19, 2006 |
| Member Year(s): | 2006 |
| DCMI Type(s): | Text |
| Data Source(s): | web collection |
| Application(s): | language modeling |
| Language(s): | English |
| Language ID(s): | eng |
| License(s): |
Web 1T 5-gram Version 1 Agreement |
| Online Documentation: | LDC2006T13 Documents |
| Licensing Instructions: | Subscription & Standard Members, and Non-Members |
| Citation: | Brants, Thorsten, and Alex Franz. Web 1T 5-gram Version 1 LDC2006T13. Web Download. Philadelphia: Linguistic Data Consortium, 2006. |
| Related Works: | View |
Introduction
Web 1T 5-gram Version 1 was contributed by Google Inc. and contains English word n-grams and their observed frequency counts for approximately 1 trillion tokens. The length of the n-grams ranges from unigrams (single words) to five-grams. This data is expected to be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.
Data
The n-gram counts were generated from text taken from publicly accessible Web pages.
The input encoding of documents was automatically detected, and all text was converted to UTF-8. The data was tokenized in a manner similar to the tokenization of the Wall Street Journal portion of the Penn Treebank. Notable exceptions include the following:
- Hyphenated word are usually separated, and hyphenated numbers usually form one token.
- Sequences of numbers separated by slashes (e.g. in dates) form one token.
- Sequences that look like urls or email addresses form one token.
The files total 24 GB compressed (gzip'ed) text files containing the following:
| Tokens | 1,024,908,267,229 |
| Sentences | 95,119,665,584 |
| Unigrams | 13,588,391 |
| Bigrams | 314,843,401 |
| Trigrams | 977,069,902 |
| Fourgrams | 1,313,818,354 |
| Fivegrams | 1,176,470,663 |
Samples
For an example of the 3-gram data in this corpus, please review this text sample (TXT).
For an example of the 4-gram data in this corpus, please review this text sample (TXT).
Updates
None at this time.