French Gigaword First Edition
Item Name: | French Gigaword First Edition |
Author(s): | David Graff |
LDC Catalog No.: | LDC2006T17 |
ISBN: | 1-58563-405-0 |
ISLRN: | 351-085-945-382-6 |
DOI: | https://doi.org/10.35111/n8na-xw24 |
Release Date: | November 17, 2006 |
Member Year(s): | 2006 |
DCMI Type(s): | Text |
Application(s): | natural language processing, language modeling, information retrieval |
Language(s): | French |
Language ID(s): | fra |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2006T17 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Graff, David. French Gigaword First Edition LDC2006T17. Web Download. Philadelphia: Linguistic Data Consortium, 2006. |
Related Works: | View |
French Gigaword First Edition is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) at the University of Pennsylvania.
The two distinct international sources of French newswire in this edition, and the time spans of collection covered for each, are as follows:
- Agence France-Presse (afp_fre) May 1994 - July 2006
- Associated Press French Service (apw_fre) Nov 1994 - July 2006
The seven-letter codes in parentheses include the three-character source name abbreviations and the three-character language code ("fre") separated by an underscore ("_") character. The three-letter language code conforms to LDC's new internal convention based on the ISO 639-3 standard.
The overall totals for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data you get when the files are uncompressed (i.e. approximately 15 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes; the "K-wrds" numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated.
Source | #Files | Gzip-MB | Totl-MB | K-wrds | #DOCs |
AFP_FRE | 147 | 1139 | 3445 | 482904 | 1797139 |
APW_FRE | 141 | 389 | 1167 | 167405 | 622740 |
TOTAL | 288 | 1528 | 4612 | 650309 | 2419879 |
The following tables present "Text-MB", "K-wrds" and "#DOCS" broken down by source and DOC type; "Text-MB" represents the total number of characters (including whitespace) after SGML tags are eliminated.
>APW_FRE
Source | Text-MB | K-wrds | #DOCs |
type="advis": | |||
AFP_FRE | 79 | 10924 | 47044 |
APW_FRE | 8 | 1381 | 6291 |
TOTAL | 87 | 12305 | 53335 |
type="multi": | |||
AFP_FRE | 40 | 5964 | 6828 |
118 | 18527 | 29797 | |
TOTAL | 158 | 24491 | 36625 |
type="other": | |||
AFP_FRE | 169 | 23723 | 155571 |
APW_FRE | 72 | 11006 | 68429 |
TOTAL | 241 | 34729 | 224000 |
type="story": | |||
AFP_FRE | 2848 | 442284 | 1587696 |
APW_FRE | 866 | 136481 | 518223 |
TOTAL | 3715 | 578765 | 2105919 |
Samples
For an example of the data in this corpus, please view this image of the text of French Gigaword.