Home › Language Resources › Data

Spanish Gigaword First Edition

Item Name:	Spanish Gigaword First Edition
Author(s):	David Graff
LDC Catalog No.:	LDC2006T12
ISBN:	1-58563-393--3
ISLRN:	683-827-849-463-2
DOI:	https://doi.org/10.35111/4kh9-er55
Release Date:	June 15, 2006
Member Year(s):	2006
DCMI Type(s):	Text
Data Source(s):	newswire
Project(s):	EARS, GALE, TIDES
Application(s):	information retrieval, language modeling, natural language processing
Language(s):	Spanish
Language ID(s):	spa
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2006T12 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Graff, David. Spanish Gigaword First Edition LDC2006T12. Web Download. Philadelphia: Linguistic Data Consortium, 2006.
Related Works: Hide	View hasVersion LDC2009T21 Spanish Gigaword Second Edition LDC2011T12 Spanish Gigaword Third Edition isSimilarWith LDC95T9 Spanish News Text LDC99T41 Spanish Newswire Text, Volume 2

Introduction

Spanish Gigaword First Edition is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) and contains over 750 million tokens spanning approximately 2.7 million documents. Although this is the first edition of the Spanish Gigaword Corpus, some of the data included here has been released previously in other LDC corpora.

The three distinct international sources of Spanish newswire in this edition, and the time spans of collection covered for each, are as follows:

Agence France-Presse, Spanish Service (afp_spa) May 1994 - Dec 2005
Associated Press Worldstream, Spanish (apw_spa) Nov 1993 - Dec 2005
Xinhua News Agency, Spanish Service (xin_spa) Sep 2001 - Dec 2005

Data

The overall totals for each source are summarized below. The "K-wrds" figures are simply the number in thousands of whitespace-separated tokens of all types after all SGML tags are eliminated.

Source	K-wrds	#DOCs
AFP_SPA	393354	1382679
APW_SPA	263225	886998
XIN_SPA	94459	388561
TOTAL	751038	2658238

Most of the text data (all of AFP_SPA, most of APW_SPA) were received at LDC via dedicated, 24-hour/day electronic feeds (leased phone lines in the case of APW_SPA, a local satellite dish for AFP_SPA). These 24-hour transmission services were all susceptible to "line noise" (occasional corruption of text content), as well as service outages both at the data source and at our receiving computers. Usually, the various disruptions of a newswire data stream would leave tell-tale evidence in the form of byte values falling outside the range of printable ASCII characters, or recognizable patterns of anomalous ASCII strings.

All the XIN_SPA data, and the portion of APW_SPA data beginning with 200406, were received as bulk electronic text archives via internet retrieval. As such, they were not susceptible to modem line-noise or related disruptions, though this does not guarantee that the source data are free of mishaps. More detailed information can be found in the included documentation.

Samples

For an example of the data in this publicaiton, please examine this sample file.

Updates

None at this time.

Spanish Gigaword First Edition

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees