README File for the SPANISH GIGAWORD TEXT ================================================ Third Edition ============= INTRODUCTION ------------ Spanish Gigaword is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) at the University of Pennsylvania. This is the third edition of the Spanish Gigaword Corpus, though some of the data included here has been released previously in the first two editions and other LDC corpora. The three distinct international sources of Spanish newswire in this edition, and the time spans of collection covered for each, are as follows: - Agence France-Presse, Spanish Service (afp_spa) May 1994 - Dec 2010 - Associated Press Worldstream, Spanish (apw_spa) Nov 1993 - Dec 2010 - Xinhua News Agency, Spanish Service (xin_spa) Sep 2001 - Dec 2010 The seven-letter codes in the parentheses above include the three-character source name abbreviations and the three-character language code ("spa") separated by an underscore ("_") character. The three-letter language code conforms to LDC's internal convention based on the ISO 639-3 standard. The seven-letter codes are used in both the directory names where the data files are found, and in the prefix that appears at the beginning of every data file name. It is also used (in all UPPER CASE) as the initial portion of the DOC "id" strings that uniquely identify each news story. DATA FORMAT AND SGML MARKUP --------------------------- Each data file name consists of the 7-letter prefix plus another underscore character, followed by a 6-digit date (representing the year and month during which the file contents were generated by the respective news source), followed by a ".gz" file extension, indicating that the file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). So, each file contains all the usable data received by LDC for the given month from the given news source. All text data are presented in SGML/XML form, using a very simple, minimal markup structure; all text consists of printable ASCII, whitespace, and printable code points in the "Latin1 Supplement" character table, as defined by both ISO-8859-1 and the Unicode Standard (ISO 10646) for the "accented" characters used in Spanish. The Supplement/accented characters are rendered using UTF-8 encoding. The file "gigaword_s.dtd" in the "dtd" directory provides the formal "Document Type Declaration" for parsing the SGML content. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using this DTD file. The markup structure, common to all data files, can be summarized as follows: The Headline Element is Optional -- not all DOCs have one The Dateline Element is Optional -- not all DOCs have one

Paragraph tags are only used if the 'type' attribute of the DOC happens to be "story" -- more on the 'type' attribute below...

Note that all data files use the UNIX-standard "\n" form of line termination, and text lines are generally wrapped to a width of 80 characters or less.

For every "opening" tag (DOC, HEADLINE, DATELINE, TEXT, P), there is a corresponding "closing" tag -- always. The attribute values in the DOC tag are always presented within double-quotes; the "id=" attribute of DOC consists of the 7-letter source/language abbreviation (in CAPS), an underscore, an 8-digit date string representing the date of the story (YYYYMMDD), a period, and a 4-digit sequence number starting at "0001" for each date (e.g. "AFP_SPA_19951231.0001"); in this way, every DOC in the corpus is uniquely identifiable by the id string. (Note that there are a few DOC tags in this release where the final sequence number in the "id" value is 5 digits -- e.g. APW_SPA_20080607.18063) There are cases where we have assigned a sequence number to a document, and later, we have found out the document is empty or very noisy. In such cases, we have removed the document from the collection, but did not reassign sequence numbers to the rest of the collection for the same day. As a result there may be some gaps in sequence numbers. Every SGML tag is presented alone on one line, separate from other tags, and from the text content (so a simple process like the UNIX "grep -v '<'" will eliminate all tags, and retain all the text content). The structure shown above represents some notable differences relative to the markup strategy used in previous LDC publications of Spanish newswire data; these are intended to facilitate bulk processing of the present corpus. The major differences are: - Earlier corpora usually organized the data as one file per day, or limited the average file size to one megabyte (MB). Typical compressed file sizes in the current corpus range from about 0.1 MB to about 10 MB; this equates to a range of about 0.5 to 30 MB per file when the data are uncompressed. In general, these files are not intended for use with interactive text editors or word processing software (though many such programs are likely to work reasonably well with these files). Rather, it's expected that the files will be used as input to programs that are geared to dealing with data in such quantities, for filtering, conditioning, indexing, statistical summary, etc. - Earlier corpora tended to use different markup outlines (different tag sets) depending on the source of the data, because different sources came to us with different structural properties, and we had chosen to preserve these as much as possible (even though many elements of the delivered structure may have been meaningless for research use). The present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs). The "dateline" is a brief string typically found at the beginning of the first paragraph in each news story, giving the location the report is coming from, and sometimes the news service and/or date; since this content is not part of the initial sentence, we separate it from the first paragraph (this was not done in previous corpora). - Earlier corpora tended to include "custom" SGML entity references, which were intended to preserve things like special punctuation or typesetting instructions (e.g. "&QL;", "&UR;", "&MD;", etc). The present corpus uses only one SGML entity reference: ``&'' (only the lower-case form is used), which represents the literal ampersand "&" character. All other specialized control characters have been filtered out, and unusual punctuation (such as the underscore character, used in APW_SPA to represent an "em-dash" character) has been converted to simple equivalents (e.g. hyphens). - In earlier corpora, newswire data were presented as streams of undifferentiated "DOC" units; depending on the source and corpus, varying amounts of quality checking and filtering were done to eliminate noisy or unsuitable content (e.g. test messages). The preparation of text data for release in Gigaword still involves a number of source-specific checks and operations, owing to intrinsic differences among the sources. In fact even for a single source, we tend to experience significant transitions in the nature of incoming data over the span of the collection; for example, APW_SPA switched from transmission via dedicated serial modem to a database-driven, XML-formatted internet delivery protocol in June 2004. However, now that we have standardized on a uniform markup format for all sources (in fact, the same format used for all Gigaword releases regardless of language), we are better able to provide a more uniform degree of quality control compared to earlier corpora. For all of the documents in this corpus, we have applied a rudimentary (and _approximate_) categorization of DOC units into four distinct "types". The classification is indicated by the `` type="string" '' attribute that is included in each opening ``DOC'' tag. The four types are: * story : This is by far the most frequent type, and it represents the most typical newswire item: a coherent report on a particular topic or event, consisting of paragraphs and full sentences. As indicated above, the paragraph tag "

" is found only in DOCs of this type; in the other types described below, the text content is rendered with no additional tags or special characters -- just lines of tokens (strings of letters, numbers and punctuation) separated by whitespace. * multi : This type of DOC contains a series of unrelated "blurbs", each of which briefly describes a particular topic or event; this is typically applied to DOCs that contain "summaries of todays news", "news briefs in ... (some general area like finance or sports)", and so on. Each paragraph-like blurb by itself is coherent and composed of sentences, but it does not bear any necessary relation oftopicality or continuity relative to neighboring paragraphs. * advis : (short for "advisory") These are DOCs which the news service addresses to news editors -- they are not intended for publication to the "end users" (the populations who read the news); as a result, DOCs of this type tend to contain obscure abbreviations and phrases, which are familiar to news editors, but may be meaningless to the general public. We also find a lot of formulaic, repetitive content in DOCs of this type (contact phone numbers, etc). * other : This represents DOCs that clearly do not fall into any of the above types -- in general, items of this type are intended for broad circulation (they are not advisories), they may be topically coherent (unlike "multi" type DOCS), and they typically do not contain paragraphs or sentences (they aren't really "stories"); these are things like lists of sports scores, stock prices, temperatures around the world, and so on. The general strategy for categorizing DOCs into these four classes was, for each source, to discover the most common and frequent clues in the text stream that correlated with the three "non-story" types, and to apply the appropriate label for the ``type=...'' attribute whenever the DOC displayed one of these specific clues. When none of the known clues was in evidence, the DOC was classified as a "story". This means that the most frequent classification error will tend to be the use of `` type="story" '' on DOCs that are actually some other type. But the number of such errors should be fairly small, compared to the number of "non-story" DOCs that are correctly tagged as such. Note that the markup was applied algorithmically, using logic that was based on less-than-complete knowledge of the data. For the most part, the HEADLINE, DATELINE and TEXT tags have their intended content; but due to the inherent variability (and the inevitable source errors) in the data, users may find occasional mishaps where the headline and/or dateline were not successfully identified (hence show up within TEXT), or where an initial sentence or paragraph has been mistakenly tagged as the headline or dateline. DATA QUANTITIES --------------- The "docs" directory contains a set of plain-text tables (datastats_*) that describe the quantities of data by source and month (i.e. by file), broken down according to the four "type" categories. The overall totals for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data you get when the files are uncompressed (i.e. approximately 8 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM; the "K-wrds" numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated. Source #Files Gzip-MB Totl-MB K-wrds #DOCs AFP_SPA 199 1414 4104 578585 2006668 APW_SPA 204 836 2566 370227 1161271 XIN_SPA 112 463 1392 199614 795912 TOTAL 515 2713 8062 1148426 3963851 The following tables present "TextKB", "K-wrds" and "#DOCS" broken down by source and DOC type. #filename #DOCS Kwords TextKB advis afp_spa 45205 20394 143579 apw_spa 10973 6089 41122 xin_spa 0 0 0 TOTAL 56178 26483 184701 multi afp_spa 15487 12761 84848 apw_spa 102500 53564 337337 xin_spa 54574 29357 187356 TOTAL 172561 95682 609541 other afp_spa 178487 40322 289703 apw_spa 153662 37816 279124 xin_spa 50036 6271 43903 TOTAL 382185 84409 612730 story afp_spa 1767489 505108 3215123 apw_spa 894136 272758 1704721 xin_spa 691302 163986 1028091 TOTAL 3352927 941852 5947935 GENERAL AND SOURCE-SPECIFIC PROPERTIES OF THE DATA -------------------------------------------------- Much of the text data (all of AFP_SPA, most of APW_SPA) were received at LDC via dedicated, 24-hour/day electronic feeds (leased phone lines in the case of APW_SPA, a local satellite dish for AFP_SPA). These 24-hour transmission services were all susceptible to "line noise" (occasional corruption of text content), as well as service outages both at the data source and at our receiving computers. Usually, the various disruptions of a newswire data stream would leave tell-tale evidence in the form of byte values falling outside the range of printable ASCII characters, or recognizable patterns of anomalous strings. All XIN_SPA data, and portion of APW_SPA data beginning with 200406, were received as bulk electronic text archives via internet retrieval. As such, they were not susceptible to modem line-noise or related disruptions, though this does not guarantee that the source data are free of mishaps. The more recent APW_SPA data have been received via internet-based subscription systems, whereby first-issue stories and editing updates are sent throughout the day to a dedicated client process running at the LDC; this process maintains a local database and story cache that holds the latest version of each distinct story for a limited number of days (in contrast to the older modem-based service, where updated versions and editing directives simply accumulated in an ever-growing data stream). In the new setup, the harvesting of stories into the growing archive is simply a matter of taking a daily snapshot of the client-program's story cache, removing stories from the snapshot if they had been captured on a previous day, and adding the remainder to the archive. As a result, the data collected in this manner tends to include less duplication of story content (because repeated transmissions of a given story, with or without minor edits, are generally not retained in the final archive). All the data have undergone a consistent extent of quality control, to eliminate out-of-band content and other obvious forms of corruption. However, in the collectin of some of the stories, line noise caused corrupted characters to be recorded. We have included a listing -- "docs/*_line_noise.txt" -- that lists the document id followed by the effected lines. Naturally, since the source data are all generated manually on a daily basis, there will be a small percentage of human errors common to all sources: missing whitespace, incorrect or variant spellings, badly formed sentences, and so on, as are normally seen in newspapers. No attempt has been made to address this property of the data As indicated above, a common feature of the modem-based archives is that stories may be repeated in the course of daily transmissions (or daily archiving). Sometimes a later transmission of a story comes with minor alterations (fixed spelling, one or more paragraphs added or removed); but just as often, the collection ends up with two or more DOCs that are fully identical. In general, though, this practice affects a relatively small minority of the overall content. Some of the internet-based delivery methods involved a more significant amount of duplicate content being received, e.g. because a given story was delivered with identical content under two or more news categories. The current release eliminates these duplications, including many that had been present in the previous release. As an aid to users who worked with the previous release, we have included a listing -- "docs/*_spa_dups_removed.txt" -- which contains two tab-delimited columns. The first column gives the doc-id of the story that was retained for a given set of identical stories, and the second column gives a space-separated list of one or more doc-ids that were removed from the corpus. In effect, if users of previous releases have any sort of citation or logging of doc-ids that were removed as duplicates, the "dups_removed" table will show which doc-id is still present in the corpus containing the same story text. In addition to the duplicated content, we also detected some duplication in the Agence France-Presse and Xinhua New Agency's document ids. Changes were made to the documents to ensure that all document ids are unique. We have included a space seperated table -- "docs/*_spa_ids_replaced.txt" -- with the following column contents from left to right: new doc id, file name, old doc id, date line, head line, and the first part of the story's content. Finally, some of the modem services typically show a practice of breaking long stories into chunks, and sending the chunks as separate DOC units, with each unit having the normal structural features of a full story. Normally, when this sort of splitting is done, cues are provided in the text of each chunk that allow editors to reconstruct the full report; but these cues tend to rely heavily on editorial skills -- it is taken for granted by each news service that the stories will be reassembled manually as needed -- so the process of combining the pieces into a full story is not amenable to an algorithmic solution, and no attempt has been made to do this. On the whole, however, this practice is quite rare in the Spanish newswire modem services. The following sections explain further data properties that are particular to each source. AFP_SPA: Intermittent noise in the satellite/modem receiving system was particularly pervasive in the late summer of 2002 and again in the late summer of 2003; the resulting loss of usable stories is reflected in the relatively small sizes of the monthly files for 200208, 200209, 200308 and 200309. Symptoms of the reception problems were typically obvious enough to be caught by algorithms, but it's possible that a noticeable number of stories in these files may contain more subtle problems, in the form of occasional "word" tokens having garbled orthography or inappropriate punctuation characters within or around the word. Apart from these, the AFP_SPA content shows a high degree of internal consistency, in terms of day-to-day content and typographic conventions. APW_SPA: During the period collected via modem, this service provided up to six other languages besides Spanish via the same modem connection, with DOCs in all languages interleaved at random. Of course, we have extracted just the Spanish content for publication here, based on flags provided in the header of each story; however we did notice a handful of cases where the header information was incorrect. The modem portion as a whole may contain a small number of articles in English, German or French. The service draws news from quasi-independent offices around the world, so there tends to be more variability here in terms of typographic conventions; there is also a noticeably higher percentage of non-story content, especially in the "other" category: tables of sports results, stocks, weather, etc. Among the data added in this edition, the data from January 2004 contained particularly noisy data due to transmission errors. We have removed documents containing explicit noises from this month. XIN_SPA: The Xinhua Spanish news archive provided fairly consistent formatting and coverage from the beginning of the collection, making it fairly easy to prepare for research use. Many stories have the distinct flavor of an official government information source, in contrast to the other news services represented here. ADDITIONAL INFORMATION AND UPDATES ---------------------------------- Additional information, updates, and bug fixes may be available in the LDC catalog entry for this corpus (LDC2009T21) at: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T21 -------------------------------------------------------------------------- Original README file written November 2006 by David Graff and updated June 2011 by Angelo Mendonca and October 2011 by Daniel Jaquette Linguistic Data Consortium