Spanish Gigaword First Edition

Item Name: Spanish Gigaword First Edition
Author(s): David Graff
LDC Catalog No.: LDC2006T12
ISBN: 1-58563-393--3
ISLRN: 683-827-849-463-2
Release Date: June 15, 2006
Member Year(s): 2006
DCMI Type(s): Text
Data Source(s): newswire
Project(s): TIDES, GALE, EARS
Application(s): natural language processing, language modeling, information retrieval
Language(s): Spanish
Language ID(s): spa
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2006T12 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Graff, David. Spanish Gigaword First Edition LDC2006T12. Web Download. Philadelphia: Linguistic Data Consortium, 2006.
Related Works: View


This file contains documentation on the Spanish Gigaword First Edition, Linguistic Data Consortium (LDC) catalog number LDC2006T12 and ISBN 1-58563-393-3.

The Spanish Gigaword Corpus is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) at the University of Pennsylvania. This is the first edition of the Spanish Gigaword Corpus, though some of the data included here has been released previously in other LDC corpora.

The three distinct international sources of Spanish newswire in this edition, and the time spans of collection covered for each, are as follows:

  • Agence France-Presse, Spanish Service (afp_spa) May 1994 - Dec 2005
  • Associated Press Worldstream, Spanish (apw_spa) Nov 1993 - Dec 2005
  • Xinhua News Agency, Spanish Service (xin_spa) Sep 2001 - Dec 2005

The seven-letter codes in the parentheses above include the three-character source name abbreviations and the three-character language code ("spa") separated by an underscore ("_") character. The three-letter language code conforms to LDC's new internal convention based on the new ISO 639-3 standard.

The seven-letter codes are used in both the directory names where the data files are found, and in the prefix that appears at the beginning of every data file name. It is also used (in all UPPER CASE) as the initial portion of the DOC "id" strings that uniquely identify each news story.


The overall totals for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data you get when the files are uncompressed (i.e. approximately 5 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes; the "K-wrds" numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated.

Source #Files Gzip-MB Totl-MB K-wrds #DOCs
AFP_SPA 139 926 2731 393354 1382679
APW_SPA 144 600 1806 263225 886998
XIN_SPA 52 212 648 94459 388561
TOTAL 335 1738 5185 751038 2658238

The following tables present "Text-MB", "K-wrds" and "#DOCS" broken down by source and DOC type; "Text-MB" represents the total number of characters (including whitespace) after SGML tags are eliminated.

Text-MB K-wrds #DOCs
AFP_SPA 40 15505 40580
APW_SPA 11 6173 11112
XIN_SPA 0 0 0
TOTAL 51 21678 51692
AFP_SPA 12 10282 12514
APW_SPA 30 12519 30892
XIN_SPA 32 17773 32463
TOTAL 74 40574 75869
AFP_SPA 126 28305 126530
APW_SPA 153 39038 153932
XIN_SPA 26 3325 26828
TOTAL 305 70668 307290
AFP_SPA 2166 339271 1202785
APW_SPA 1287 205501 691062
XIN_SPA 463 73360 329270
TOTAL 3916 618132 2223117


For an example of the data in this publicaiton, please examine this sample file.

Available Media

View Fees

Login for the applicable fee