The DOC directory contains samples of document units (news articles) from each of the sources in the corpus, with brief explanations of the SGML markup employed in each collection of texts.
Four of the five sources provided here have been collected by means of continous feeds from the news providers over modem connections. Incoming data from each modem was spooled directly to a "raw collection" file on a daily basis, and the raw files were then processed to produce a consistent format for release by the LDC.
We have taken a variety of steps to remove articles that were corrupted by failures or noise in modem transmission. The kinds of corruption that we were able to eliminate include truncated articles (a valid end-of-article sequence is not observed before a valid start-of-article), and invalid character codes within the text segment of articles. Some corruptions may have occured that did not produce these symptoms (e.g. service interruptions that might cause partial loss of data within or across articles, or corruptions that garble the content but happen not to produce any invalid character codes). At present we have no means for detecting these more subtle problems in the data, but we expect that they are relatively infrequent.
The consistent format chosen for release consists of SGML tagging (since this gives a fairly simple and self-explanatory presentation of the data), and the ISO-8859-1 (Latin1) 8-bit character set. Our general strategy for SGML tagging is as follows:
All document units (articles) are bounded by the tags <DOC> and </DOC>, and within these units, the text content of each article is bounded by <TEXT> and </TEXT>. Following each <DOC> tag is a <DOCID> tag that provides a unique identifying string for that article. Other tags within the <DOC> unit (but external to <TEXT>) provide additional information that was receieved with the article (e.g. headline, dateline, byline, keywords, etc), but the inventory and nature of additional information varies from one source to the next (and in some cases, from one article to the next), and this variability is reflected in the SGML tags that are used to preserve the information. Within the <TEXT> units, tagging is kept to a minimum, typically consisting only of <p> to mark paragraph boundaries.
The samples of the corpora are follow: