French Gigaword First Edition


Item Name: French Gigaword First Edition
Authors: David Graff
LDC Catalog No.: LDC2006T17
ISBN: 1-58563-405-0
Release Date: Nov 17, 2006
Data Type: text
Application(s): information retrieval, language modeling, natural language processing
Language(s): French
Language ID(s): fra
Distribution: 1 DVD
Member fee: $0 for 2006 members
Non-member Fee: US $3500.00
Reduced-License Fee: US $1750.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: David Graff
2006
French Gigaword First Edition
Linguistic Data Consortium, Philadelphia

French Gigaword First Edition is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) at the University of Pennsylvania.

The two distinct international sources of French newswire in this edition, and the time spans of collection covered for each, are as follows:

  • Agence France-Presse (afp_fre) May 1994 - July 2006
  • Associated Press French Service (apw_fre) Nov 1994 - July 2006

The seven-letter codes in parentheses include the three-character source name abbreviations and the three-character language code ("fre") separated by an underscore ("_") character. The three-letter language code conforms to LDC's new internal convention based on the ISO 639-3 standard.

The overall totals for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data you get when the files are uncompressed (i.e. approximately 15 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM; the "K-wrds" numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated.

Source #Files Gzip-MB Totl-MB K-wrds #DOCs
AFP_FRE 147 1139 3445 482904 1797139
APW_FRE 141 389 1167 167405 622740
TOTAL 288 1528 4612 650309 2419879

The following tables present "Text-MB", "K-wrds" and "#DOCS" broken down by source and DOC type; "Text-MB" represents the total number of characters (including whitespace) after SGML tags are eliminated.

>APW_FRE
Source Text-MB K-wrds #DOCs
type="advis":
AFP_FRE 79 10924 47044
APW_FRE 8 1381 6291
TOTAL 87 12305 53335
type="multi":
AFP_FRE 40 5964 6828
118 18527 29797
TOTAL 158 24491 36625
type="other":
AFP_FRE 169 23723 155571
APW_FRE 72 11006 68429
TOTAL 241 34729 224000
type="story":
AFP_FRE 2848 442284 1587696
APW_FRE 866 136481 518223
TOTAL 3715 578765 2105919

Samples

For an example of the data in this corpus, please view this image of the text of French Gigaword.

Content Copyright

Portions 1994-2006 Agence France-Presse, 1994-2006 The Associated Press, 2006 Trustees of the University of Pennsylvania