French Gigaword First Edition

Item Name: French Gigaword First Edition
Author(s): David Graff
LDC Catalog No.: LDC2006T17
ISBN: 1-58563-405-0
ISLRN: 351-085-945-382-6
Release Date: November 17, 2006
Member Year(s): 2006
DCMI Type(s): Text
Application(s): natural language processing, language modeling, information retrieval
Language(s): French
Language ID(s): fra
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2006T17 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Graff, David. French Gigaword First Edition LDC2006T17. Web Download. Philadelphia: Linguistic Data Consortium, 2006.

French Gigaword First Edition is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) at the University of Pennsylvania.

The two distinct international sources of French newswire in this edition, and the time spans of collection covered for each, are as follows:

  • Agence France-Presse (afp_fre) May 1994 - July 2006
  • Associated Press French Service (apw_fre) Nov 1994 - July 2006

The seven-letter codes in parentheses include the three-character source name abbreviations and the three-character language code ("fre") separated by an underscore ("_") character. The three-letter language code conforms to LDC's new internal convention based on the ISO 639-3 standard.

The overall totals for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data you get when the files are uncompressed (i.e. approximately 15 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM; the "K-wrds" numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated.

Source #Files Gzip-MB Totl-MB K-wrds #DOCs
AFP_FRE 147 1139 3445 482904 1797139
APW_FRE 141 389 1167 167405 622740
TOTAL 288 1528 4612 650309 2419879

The following tables present "Text-MB", "K-wrds" and "#DOCS" broken down by source and DOC type; "Text-MB" represents the total number of characters (including whitespace) after SGML tags are eliminated.

>APW_FRE
Source Text-MB K-wrds #DOCs
type="advis":
AFP_FRE 79 10924 47044
APW_FRE 8 1381 6291
TOTAL 87 12305 53335
type="multi":
AFP_FRE 40 5964 6828
118 18527 29797
TOTAL 158 24491 36625
type="other":
AFP_FRE 169 23723 155571
APW_FRE 72 11006 68429
TOTAL 241 34729 224000
type="story":
AFP_FRE 2848 442284 1587696
APW_FRE 866 136481 518223
TOTAL 3715 578765 2105919

Samples

For an example of the data in this corpus, please view this image of the text of French Gigaword.

Available Media

View Fees





Login for the applicable fee