Web 1T 5-gram Version 1
Item Name: | Web 1T 5-gram Version 1 |
Author(s): | Thorsten Brants, Alex Franz |
LDC Catalog No.: | LDC2006T13 |
ISBN: | 1-58563-397-6 |
ISLRN: | 831-344-220-094-6 |
DOI: | https://doi.org/10.35111/cqpa-a498 |
Release Date: | September 19, 2006 |
Member Year(s): | 2006 |
DCMI Type(s): | Text |
Data Source(s): | web collection |
Application(s): | language modeling |
Language(s): | English |
Language ID(s): | eng |
License(s): |
Web 1T 5-gram Version 1 Agreement |
Online Documentation: | LDC2006T13 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Brants, Thorsten, and Alex Franz. Web 1T 5-gram Version 1 LDC2006T13. Web Download. Philadelphia: Linguistic Data Consortium, 2006. |
Related Works: | View |
Introduction
Web 1T 5-gram Version 1 was contributed by Google Inc. and contains English word n-grams and their observed frequency counts for approximately 1 trillion tokens. The length of the n-grams ranges from unigrams (single words) to five-grams. This data is expected to be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.
Data
The n-gram counts were generated from text taken from publicly accessible Web pages.
The input encoding of documents was automatically detected, and all text was converted to UTF-8. The data was tokenized in a manner similar to the tokenization of the Wall Street Journal portion of the Penn Treebank. Notable exceptions include the following:
- Hyphenated word are usually separated, and hyphenated numbers usually form one token.
- Sequences of numbers separated by slashes (e.g. in dates) form one token.
- Sequences that look like urls or email addresses form one token.
The files total 24 GB compressed (gzip'ed) text files containing the following:
Tokens | 1,024,908,267,229 |
Sentences | 95,119,665,584 |
Unigrams | 13,588,391 |
Bigrams | 314,843,401 |
Trigrams | 977,069,902 |
Fourgrams | 1,313,818,354 |
Fivegrams | 1,176,470,663 |
Samples
For an example of the 3-gram data in this corpus, please review this text sample (TXT).
For an example of the 4-gram data in this corpus, please review this text sample (TXT).
Updates
None at this time.