Home › Language Resources › Data

Spanish Gigaword Second Edition

Item Name:	Spanish Gigaword Second Edition
Author(s):	Ângelo Mendonça, David Graff, Denise DiPersio
LDC Catalog No.:	LDC2009T21
ISBN:	1-58563-518-9
ISLRN:	202-219-770-615-1
DOI:	https://doi.org/10.35111/hwap-pf44
Release Date:	July 17, 2009
Member Year(s):	2009
DCMI Type(s):	Text
Data Source(s):	newswire
Project(s):	TIDES, GALE, EARS
Application(s):	natural language processing, language modeling, information retrieval
Language(s):	Spanish
Language ID(s):	spa
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2009T21 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Mendonça, Ângelo, David Graff, and Denise DiPersio. Spanish Gigaword Second Edition LDC2009T21. Web Download. Philadelphia: Linguistic Data Consortium, 2009.
Related Works: Hide	View isVersionOf LDC2006T12 Spanish Gigaword First Edition hasVersion LDC2011T12 Spanish Gigaword Third Edition isSimilarWith LDC95T9 Spanish News Text LDC99T41 Spanish Newswire Text, Volume 2

Introduction

Spanish Gigaword Second Edition is a comprehensive archive of newswire text data that has been acquired over several years by LDC. This second edition updates Spanish Gigaword First Edition (LDC2006T12) and adds data collected from January 1, 2006 through December 31, 2008.

The three distinct international sources of Spanish newswire in this edition, and the time spans of collection covered for each, are as follows:

Agence France-Presse, Spanish Service (afp_spa) May 1994 - Dec 2008
Associated Press Worldstream, Spanish (apw_spa) Nov 1993 - Dec 2008
Xinhua News Agency, Spanish Service (xin_spa) Sep 2001 - Dec 2008

The seven-letter codes in the parentheses above include the three-character source name abbreviations and the three-character language code (spa) separated by an underscore (_) character. The three-letter language code conforms to LDCs internal convention based on the ISO 639-3 standard. These codes are used in the directory names where the data files are found and in the prefix that appears at the beginning of every data file name. They are also used (in all UPPER CASE) as the initial portion of the DOC id strings that uniquely identify each news story.

Data

The overall totals for each source are summarized below. Note that the Totl-MB numbers show the amount of data obtained when the files are uncompressed (i.e. approximately 7 gigabytes, total) the Gzip-MB column shows totals for compressed file sizes and the K-wrds numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated.

Source	#Files	Gzip-MB	Totl-MB	K-wrds	#DOCs
AFP_SPA	175	1182	3512	506562	1748787
APW_SPA	180	886	2721	402718	1244811
XIN_SPA	88	405	1238	182543	734356
TOTAL	443	2453	7471	1091823	3727954

The following tables present Text-MB, K-wrds and #DOCS broken down by source and DOC type Text-MB represents the total number of characters (including whitespace) after SGML tags are eliminated.

Text-MB	K-wrds	#DOCs
type=advis:
AFP_SPA	144	20520	45446
APW_SPA	41	6173	11112
XIN_SPA	0	0	0
TOTAL	185	26693	56558
type=multi:
AFP_SPA	84	12711	15346
APW_SPA	351	55758	107224
XIN_SPA	189	29970	56372
TOTAL	624	98439	178942
type=other:
AFP_SPA	275	38665	160815
APW_SPA	296	40517	162448
XIN_SPA	44	6376	50168
TOTAL	615	85558	373431
type=story:
AFP_SPA	2771	434677	1527180
APW_SPA	1875	300274	964027
XIN_SPA	911	146199	627816
TOTAL	5557	881150	3119023

Samples

Please view this sample.

Spanish Gigaword Second Edition

Introduction

Data

Samples

Copyright

Available Media

View Fees