Spanish Gigaword First Edition
|Item Name:||Spanish Gigaword First Edition|
|LDC Catalog No.:||LDC2006T12|
|Release Date:||June 15, 2006|
|Project(s):||TIDES, GALE, EARS|
|Application(s):||natural language processing, language modeling, information retrieval|
LDC User Agreement for Non-Members
|Online Documentation:||LDC2006T12 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Graff, David. Spanish Gigaword First Edition LDC2006T12. Web Download. Philadelphia: Linguistic Data Consortium, 2006.|
This file contains documentation on the Spanish Gigaword First Edition, Linguistic Data Consortium (LDC) catalog number LDC2006T12 and ISBN 1-58563-393-3.
The Spanish Gigaword Corpus is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) at the University of Pennsylvania. This is the first edition of the Spanish Gigaword Corpus, though some of the data included here has been released previously in other LDC corpora.
The three distinct international sources of Spanish newswire in this edition, and the time spans of collection covered for each, are as follows:
- Agence France-Presse, Spanish Service (afp_spa) May 1994 - Dec 2005
- Associated Press Worldstream, Spanish (apw_spa) Nov 1993 - Dec 2005
- Xinhua News Agency, Spanish Service (xin_spa) Sep 2001 - Dec 2005
The seven-letter codes in the parentheses above include the three-character source name abbreviations and the three-character language code ("spa") separated by an underscore ("_") character. The three-letter language code conforms to LDC's new internal convention based on the new ISO 639-3 standard.
The seven-letter codes are used in both the directory names where the data files are found, and in the prefix that appears at the beginning of every data file name. It is also used (in all UPPER CASE) as the initial portion of the DOC "id" strings that uniquely identify each news story.
The overall totals for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data you get when the files are uncompressed (i.e. approximately 5 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes; the "K-wrds" numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated.
The following tables present "Text-MB", "K-wrds" and "#DOCS" broken down by source and DOC type; "Text-MB" represents the total number of characters (including whitespace) after SGML tags are eliminated.
For an example of the data in this publicaiton, please examine this sample file.