Web 1T 5-gram Version 1

Item Name: Web 1T 5-gram Version 1
Author(s): Thorsten Brants, Alex Franz
LDC Catalog No.: LDC2006T13
ISBN: 1-58563-397-6
ISLRN: 831-344-220-094-6
DOI: https://doi.org/10.35111/cqpa-a498
Release Date: September 19, 2006
Member Year(s): 2006
DCMI Type(s): Text
Data Source(s): web collection
Application(s): language modeling
Language(s): English
Language ID(s): eng
License(s): Web 1T 5-gram Version 1 Agreement
Online Documentation: LDC2006T13 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Brants, Thorsten, and Alex Franz. Web 1T 5-gram Version 1 LDC2006T13. Web Download. Philadelphia: Linguistic Data Consortium, 2006.
Related Works: View

Introduction

Web 1T 5-gram Version 1 was contributed by Google Inc. and contains English word n-grams and their observed frequency counts for approximately 1 trillion tokens. The length of the n-grams ranges from unigrams (single words) to five-grams. This data is expected to be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.

Data

The n-gram counts were generated from text taken from publicly accessible Web pages.

The input encoding of documents was automatically detected, and all text was converted to UTF-8. The data was tokenized in a manner similar to the tokenization of the Wall Street Journal portion of the Penn Treebank. Notable exceptions include the following:

  • Hyphenated word are usually separated, and hyphenated numbers usually form one token.
  • Sequences of numbers separated by slashes (e.g. in dates) form one token.
  • Sequences that look like urls or email addresses form one token.

The files total 24 GB compressed (gzip'ed) text files containing the following:

Tokens 1,024,908,267,229
Sentences 95,119,665,584
Unigrams 13,588,391
Bigrams 314,843,401
Trigrams 977,069,902
Fourgrams 1,313,818,354
Fivegrams 1,176,470,663

Samples

For an example of the 3-gram data in this corpus, please review this text sample (TXT).

For an example of the 4-gram data in this corpus, please review this text sample (TXT).

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee