Home › Language Resources › Data

Web 1T 5-gram Version 1

Item Name:	Web 1T 5-gram Version 1
Author(s):	Thorsten Brants, Alex Franz
LDC Catalog No.:	LDC2006T13
ISBN:	1-58563-397-6
ISLRN:	831-344-220-094-6
DOI:	https://doi.org/10.35111/cqpa-a498
Release Date:	September 19, 2006
Member Year(s):	2006
DCMI Type(s):	Text
Data Source(s):	web collection
Application(s):	language modeling
Language(s):	English
Language ID(s):	eng
License(s):	Web 1T 5-gram Version 1 Agreement
Online Documentation:	LDC2006T13 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Brants, Thorsten, and Alex Franz. Web 1T 5-gram Version 1 LDC2006T13. Web Download. Philadelphia: Linguistic Data Consortium, 2006.
Related Works: Hide	View isSimilarWith LDC2009T08 Japanese Web N-gram Version 1 LDC2009T25 Web 1T 5-gram, 10 European Languages Version 1 LDC2010T06 Chinese Web 5-gram Version 1

Introduction

Web 1T 5-gram Version 1 was contributed by Google Inc. and contains English word n-grams and their observed frequency counts for approximately 1 trillion tokens. The length of the n-grams ranges from unigrams (single words) to five-grams. This data is expected to be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.

Data

The n-gram counts were generated from text taken from publicly accessible Web pages.

The input encoding of documents was automatically detected, and all text was converted to UTF-8. The data was tokenized in a manner similar to the tokenization of the Wall Street Journal portion of the Penn Treebank. Notable exceptions include the following:

Hyphenated word are usually separated, and hyphenated numbers usually form one token.
Sequences of numbers separated by slashes (e.g. in dates) form one token.
Sequences that look like urls or email addresses form one token.

The files total 24 GB compressed (gzip'ed) text files containing the following:

Tokens	1,024,908,267,229
Sentences	95,119,665,584
Unigrams	13,588,391
Bigrams	314,843,401
Trigrams	977,069,902
Fourgrams	1,313,818,354
Fivegrams	1,176,470,663

Samples

For an example of the 3-gram data in this corpus, please review this text sample (TXT).

For an example of the 4-gram data in this corpus, please review this text sample (TXT).

Updates

None at this time.

Web 1T 5-gram Version 1

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees