Home › Language Resources › Data

French Gigaword Second Edition

Item Name:	French Gigaword Second Edition
Author(s):	Ângelo Mendonça, David Graff, Denise DiPersio
LDC Catalog No.:	LDC2009T28
ISBN:	1-58563-528-6
ISLRN:	739-169-067-045-4
DOI:	https://doi.org/10.35111/5s4k-q428
Release Date:	November 20, 2009
Member Year(s):	2009
DCMI Type(s):	Text
Data Source(s):	newswire
Application(s):	natural language processing, language modeling, information retrieval
Language(s):	French
Language ID(s):	fra
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2009T28 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Mendonça, Ângelo, David Graff, and Denise DiPersio. French Gigaword Second Edition LDC2009T28. Web Download. Philadelphia: Linguistic Data Consortium, 2009.
Related Works: Hide	View isVersionOf LDC2006T17 French Gigaword First Edition hasVersion LDC2011T10 French Gigaword Third Edition

Introduction

French Gigaword Second Edition is a comprehensive archive of newswire text data that has been acquired over several years by LDC. This second edition updates French Gigaword First Edition (LDC2006T17) and adds material collected from August 1, 2006 through December 31, 2008.

The two distinct international sources of French newswire in this edition, and the time spans of collection covered for each, are as follows:

Agence France-Presse (afp_fre) May 1994 - Dec 2008
Associated Press Worldstream, French (apw_fre) Nov 1994 - Dec 2008

The seven-letter codes in parentheses include the three-character source name abbreviations and the three-character language code (fre) separated by an underscore (_) character. The three-letter language code conforms to LDCs internal convention based on the ISO 639-3 standard. These codes are used in the directory names where the data files are found and in the prefix that appears at the beginning of every data file name. They are also used (in all UPPER CASE) as the initial portion of the DOC id strings that uniquely identify each news story.

Data

The overall totals for each source are summarized below. The Totl-MB numbers show the amount of data obtained when the files are uncompressed (i.e., approximately 15 gigabytes, total) the Gzip-MB column shows totals for compressed file sizes and the K-wrds numbers are the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated.

Source	#Files	Gzip-MB	Totl-MB	K-wrds	#DOCs
AFP_FRE	172	2408	4079	560000	2060803
APW_FRE	171	2280	1719	241324	0872573
TOTAL	343	4688	5789	801324	2933376

The following tables present Text-MB, K-wrds and #DOCS broken down by source and DOC type Text-MB represents the total number of characters (including whitespace) after SGML tags are eliminated.

APW_FRE1942982852240

Source	Text-MB	K-wrds	#DOCs
type=advis:
AFP_FRE	88	11788	48712
APW_FRE	14	2303	9235
TOTAL	103	14091	57947
type=multi:
AFP_FRE	59	8411	10269
TOTAL	253	38239	62509
type=other:
AFP_FRE	178	58514	8411
APW_FRE	82	193981	29828
TOTAL	260	38239	38239
type=story:
AFP_FRE	1824	198440	27216
APW_FRE	729	87662	13006
TOTAL	2553	286102	40222

The data has undergone a consistent extent of quality control to eliminate out-of-band content and other obvious forms of corruption. Since the source data is generated manually on a daily basis, there will be a small percentage of human errors common to all sources: missing whitespace, incorrect or variant spellings, badly formed sentences, and so on, as are normally seen in newspapers. No attempt has been made to address this property of the data.

Samples

For an example of the data in this corpus, please view this image of the text of French Gigaword.

French Gigaword Second Edition

Introduction

Data

Samples

Copyright

Available Media

View Fees