Spanish Gigaword Second Edition


Item Name: Spanish Gigaword Second Edition
Authors: Ângelo Mendonça, David Graff, Denise DiPersio
LDC Catalog No.: LDC2009T21
ISBN: 1-58563-518-9
Release Date: Jul 17, 2009
Data Type: text
Data Source(s): newswire
Project(s): EARS, GALE, TIDES
Application(s): information retrieval, language modeling, natural language processing
Language(s): Spanish
Language ID(s): spa
Distribution: 1 DVD
Member fee: $0 for 2009 members
Non-member Fee: US $4000.00
Reduced-License Fee: US $2000.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Ângelo Mendonça, David Graff, Denise DiPersio
2009
Spanish Gigaword Second Edition
Linguistic Data Consortium, Philadelphia

Introduction

Spanish Gigaword Second Edition is a comprehensive archive of newswire text data that has been acquired over several years by LDC. This second edition updates Spanish Gigaword First Edition (LDC2006T12) and adds data collected from January 1, 2006 through December 31, 2008.

The three distinct international sources of Spanish newswire in this edition, and the time spans of collection covered for each, are as follows:

  • Agence France-Presse, Spanish Service (afp_spa) May 1994 - Dec 2008
  • Associated Press Worldstream, Spanish (apw_spa) Nov 1993 - Dec 2008
  • Xinhua News Agency, Spanish Service (xin_spa) Sep 2001 - Dec 2008

The seven-letter codes in the parentheses above include the three-character source name abbreviations and the three-character language code (spa) separated by an underscore (_) character. The three-letter language code conforms to LDCs internal convention based on the ISO 639-3 standard. These codes are used in the directory names where the data files are found and in the prefix that appears at the beginning of every data file name. They are also used (in all UPPER CASE) as the initial portion of the DOC id strings that uniquely identify each news story.

Data

The overall totals for each source are summarized below. Note that the Totl-MB numbers show the amount of data obtained when the files are uncompressed (i.e. approximately 7 gigabytes, total) the Gzip-MB column shows totals for compressed file sizes as stored on the DVD-ROM the K-wrds numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated.

Source #Files Gzip-MB Totl-MB K-wrds #DOCs
AFP_SPA 175 1182 3512 506562 1748787
APW_SPA 180 886 2721 402718 1244811
XIN_SPA 88 405 1238 182543 734356
TOTAL 443 2453 7471 1091823 3727954

The following tables present Text-MB, K-wrds and #DOCS broken down by source and DOC type Text-MB represents the total number of characters (including whitespace) after SGML tags are eliminated.

Text-MB K-wrds #DOCs
type=advis:
AFP_SPA 144 20520 45446
APW_SPA 41 6173 11112
XIN_SPA 0 0 0
TOTAL 185 26693 56558
type=multi:
AFP_SPA 84 12711 15346
APW_SPA 351 55758 107224
XIN_SPA 189 29970 56372
TOTAL 624 98439 178942
type=other:
AFP_SPA 275 38665 160815
APW_SPA 296 40517 162448
XIN_SPA 44 6376 50168
TOTAL 615 85558 373431
type=story:
AFP_SPA 2771 434677 1527180
APW_SPA 1875 300274 964027
XIN_SPA 911 146199 627816
TOTAL 5557 881150 3119023

Samples

Please view this sample.

Content Copyright

Portions © 1994-2008 Agence France Presse, © 1993-2008 The Associated Press, © 2001-2008 Xinhua News Agency, © 2006, 2009 Trustees of the University of Pennsylvania