Spanish Newswire Text, Volume 2
Item Name: | Spanish Newswire Text, Volume 2 |
Author(s): | David Graff, Gustavo Gallegos |
LDC Catalog No.: | LDC99T41 |
ISBN: | 1-58563-162-0 |
ISLRN: | 581-480-117-182-1 |
DOI: | https://doi.org/10.35111/q9gq-nf98 |
Member Year(s): | 1999 |
DCMI Type(s): | Text |
Data Source(s): | newswire |
Project(s): | TIDES, GALE |
Application(s): | information retrieval, language modeling |
Language(s): | Spanish |
Language ID(s): | spa |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC99T41 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Graff, David, and Gustavo Gallegos. Spanish Newswire Text, Volume 2 LDC99T41. Web Download. Philadelphia: Linguistic Data Consortium, 1999. |
Related Works: | View |
Introduction
This release of Spanish newswire contains data from the following sources:
Data
The consistent format chosen for release consists of SGML tagging and the ISO-8859-1 (Latin1) 8-bit character set. Our general strategy for SGML tagging is as follows:
All document units (articles) are bounded by the tags DOC and /DOC, and within these units, the text content of each article is bounded by TEXT and /TEXT. Following each DOC tag is a DOCID tag that provides a unique identifying string for that article. Other tags within the DOC unit (but external to TEXT) provide additional information that was receieved with the article (e.g. headline, dateline, byline, keywords, etc), but the inventory and nature of additional information varies from one source to the next (and in some cases, from one article to the next), and this variability is reflected in the SGML tags that are used to preserve the information. Within the TEXT units, tagging is kept to a minimum, typically consisting only of paragraph tags.