Web 1T 5-gram Version 1
|Item Name:||Web 1T 5-gram Version 1|
|Author(s):||Thorsten Brants, Alex Franz|
|LDC Catalog No.:||LDC2006T13|
|Release Date:||September 19, 2006|
|Data Source(s):||web collection|
Web 1T 5-gram Version 1 Agreement
|Online Documentation:||LDC2006T13 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Brants, Thorsten, and Alex Franz. Web 1T 5-gram Version 1 LDC2006T13. Web Download. Philadelphia: Linguistic Data Consortium, 2006.|
Web 1T 5-gram Version 1 was contributed by Google Inc. and contains English word n-grams and their observed frequency counts for approximately 1 trillion tokens. The length of the n-grams ranges from unigrams (single words) to five-grams. This data is expected to be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.
The n-gram counts were generated from text taken from publicly accessible Web pages.
The input encoding of documents was automatically detected, and all text was converted to UTF-8. The data was tokenized in a manner similar to the tokenization of the Wall Street Journal portion of the Penn Treebank. Notable exceptions include the following:
- Hyphenated word are usually separated, and hyphenated numbers usually form one token.
- Sequences of numbers separated by slashes (e.g. in dates) form one token.
- Sequences that look like urls or email addresses form one token.
The files total 24 GB compressed (gzip'ed) text files containing the following:
For an example of the 3-gram data in this corpus, please review this text sample (TXT).
For an example of the 4-gram data in this corpus, please review this text sample (TXT).
None at this time.