Spanish Gigaword First Edition


Item Name: Spanish Gigaword First Edition
Authors: David Graff
LDC Catalog No.: LDC2006T12
ISBN: 1-58563-393--3
Release Date: Jun 15, 2006
Data Type: text
Data Source(s): newswire
Project(s): EARS, GALE, TIDES
Application(s): information retrieval, language modeling, natural language processing
Language(s): Spanish
Distribution: 1 DVD
Member fee: $0 for 2006 members
Non-member Fee: US $3500.00
Reduced-License Fee: US $1750.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: David Graff
2006
Spanish Gigaword First Edition
Linguistic Data Consortium, Philadelphia

Introduction

This file contains documentation on the Spanish Gigaword First Edition, Linguistic Data Consortium (LDC) catalog number LDC2006T12 and ISBN 1-58563-393-3.

The Spanish Gigaword Corpus is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) at the University of Pennsylvania. This is the first edition of the Spanish Gigaword Corpus, though some of the data included here has been released previously in other LDC corpora.

The three distinct international sources of Spanish newswire in this edition, and the time spans of collection covered for each, are as follows:

  • Agence France-Presse, Spanish Service (afp_spa) May 1994 - Dec 2005
  • Associated Press Worldstream, Spanish (apw_spa) Nov 1993 - Dec 2005
  • Xinhua News Agency, Spanish Service (xin_spa) Sep 2001 - Dec 2005

The seven-letter codes in the parentheses above include the three-character source name abbreviations and the three-character language code ("spa") separated by an underscore ("_") character. The three-letter language code conforms to LDC's new internal convention based on the new ISO 639-3 standard.

The seven-letter codes are used in both the directory names where the data files are found, and in the prefix that appears at the beginning of every data file name. It is also used (in all UPPER CASE) as the initial portion of the DOC "id" strings that uniquely identify each news story.

Data

The overall totals for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data you get when the files are uncompressed (i.e. approximately 5 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM; the "K-wrds" numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated.

Source #Files Gzip-MB Totl-MB K-wrds #DOCs
AFP_SPA 139 926 2731 393354 1382679
APW_SPA 144 600 1806 263225 886998
XIN_SPA 52 212 648 94459 388561
TOTAL 335 1738 5185 751038 2658238

The following tables present "Text-MB", "K-wrds" and "#DOCS" broken down by source and DOC type; "Text-MB" represents the total number of characters (including whitespace) after SGML tags are eliminated.

Text-MB K-wrds #DOCs
type="advis":
AFP_SPA 40 15505 40580
APW_SPA 11 6173 11112
XIN_SPA 0 0 0
TOTAL 51 21678 51692
type="multi":
AFP_SPA 12 10282 12514
APW_SPA 30 12519 30892
XIN_SPA 32 17773 32463
TOTAL 74 40574 75869
type="other":
AFP_SPA 126 28305 126530
APW_SPA 153 39038 153932
XIN_SPA 26 3325 26828
TOTAL 305 70668 307290
AFP_SPA 2166 339271 1202785
APW_SPA 1287 205501 691062
XIN_SPA 463 73360 329270
TOTAL 3916 618132 2223117

Samples

For an example of the data in this publicaiton, please examine this sample file.

Content Copyright

Portions 1994-2005 Agence France Presse, 1993-2005 The Associated Press, 2001-2005 Xinhua News Agency, 2006 Trustees of the University of Pennsylvania