Web 1T 5-gram, 10 European Languages Version 1

Item Name: Web 1T 5-gram, 10 European Languages Version 1
Author(s): Thorsten Brants, Alex Franz
LDC Catalog No.: LDC2009T25
ISBN: 1-58563-525-1
ISLRN: 930-499-840-946-0
DOI: https://doi.org/10.35111/mesn-fv79
Release Date: October 20, 2009
Member Year(s): 2009
DCMI Type(s): Text
Data Source(s): web collection
Application(s): language modeling
Language(s): Swedish, Spanish, Romanian, Portuguese, Polish, Dutch, Italian, French, German, Czech
Language ID(s): swe, spa, ron, por, pol, nld, ita, fra, deu, ces
License(s): Web 1T 5-gram, 10 European Languages Version 1 Agreement
Online Documentation: LDC2009T25 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Brants, Thorsten, and Alex Franz. Web 1T 5-gram, 10 European Languages Version 1 LDC2009T25. Web Download. Philadelphia: Linguistic Data Consortium, 2009.
Related Works: View

Introduction

Web 1T 5-gram, 10 European Languages Version 1 was created by Google, Inc. It consists of word n-grams and their observed frequency counts for ten European languages: Czech, Dutch, French, German, Italian, Polish, Portuguese, Romanian, Spanish and Swedish. The length of the n-grams ranges from unigrams (single words) to five-grams. The n-gram counts were generated from approximately one hundred billion word tokens of text for each language, or approximately one trillion total tokens.

The n-grams were extracted from publicly-accessible web pages from October 2008 to December 2008. This data set contains only n-grams that appeared at least 40 times in the processed sentences. Less frequent n-grams were discarded. While the aim was to identify and collect pages from the specific target languages only, it is likely that some text from other languages may be in the final data. This dataset will be useful for statistical language modeling, including machine translation, speech recognition and other uses.

Data

The input encoding of documents was automatically detected, and all text was converted to UTF8.

The following table contains statistics for the entire release.

File sizes (entire corpus): approximately 27.9 GB compressed (bzip2) text files

Total number of tokens: 1,306,807,412,486
Total number of sentences: 150,727,365,731
Total number of unigrams: 95,998,281
Total number of bigrams: 646,439,858
Total number of trigrams: 1,312,972,925
Total number of fourgrams: 1,396,154,236
Total number of fivegrams: 1,149,361,413
Total number of n-grams: 4,600,926,713

Samples

For an example of the data in this corpus please examine this sample file.

Available Media

View Fees





Login for the applicable fee