File: README.txt
------------------
This single CD contains the Korean Newswire Text Corpus, produced by the
Linguistic Data Consortium (LDC); catalog number LDC2000T45, isbn
1-58563-168-X. This corpus is a collection of Korean Press Agency news
articles from June 2, 1994, to March 20, 2000. Additional information is
available at the LDC web site, http://www.ldc.upenn.edu/Catalog under by
year|2000|LDC2000T45.
The corpus includes articles from the date ranges listed below, however not all
dates in each interval are represented by files or articles:
1994 Jun. 2 to Dec. 31 87 files
1995 Jan. 1 to Dec. 31 179 files
1996 Jan. 1 to Mar. 29 83 files
1997 Jul 28 to Dec. 31 245 files
1998 Jan. 2 to Dec. 31 285 files
1999 Jan. 3 to Dec. 31 216 files
2000 Jan. 3 to Mar. 20 56 files
The data files are in the /data subdirectory and are collected by date in files
named as yyyymmdd.sgm where yyyy = year, mm = month, and dd = date. There are
1,151 files containing 143,137 articles. It is probable that there are
duplicate articles in this corpus.
The example.doc file contains a sample news article from the corpus, to
demonstrate the SGML markup employed in the collection of text.
The articles provided here have been collected by means of a continuous feed
from the news provider over a modem connection. Incoming data from the modem
was spooled directly to a "raw collection" file on a daily basis, and the raw
files were then processed to produce the following format for release by the
LDC.
We have taken steps to remove articles that were corrupted by failures or noise
in modem transmission. The kinds of corruption that we were able to eliminate
include truncated articles (a valid end-of-article sequence is not observed
before a valid start-of-article), and invalid character codes within the text
segment of articles. Some corruption may have occurred that did not produce
these symptoms (e.g. service interruptions that might cause partial loss of
data within or across articles, or corruptions that garble the content but
happen not to produce any invalid character codes). At present we have no
means for detecting these more subtle problems in the data, but we expect that
they are relatively infrequent.
The format chosen for release consists of SGML tagging (since this gives a
fairly simple and self-explanatory presentation of the data), and the KSC-5601
Korean character encoding. The SGML tagging is as follows:
yyyymmdd.nnnnyyyymmddKoreanKorean Press Agency:
# there may be several paragraphs within each article.
Korean Text
Time
Not all articles have information in all the tag fields. Within the
units, tagging is kept to a minimum, typically consisting only of tags
to mark paragraph boundaries. The data files were validated using nsgmls and
the kn.dtd.
The unique strings have the format yyyymmdd.nnnn where:
yyyy = Year
mm = Month
dd = Day
nnnn = Sequence Number
For all articles that share the same yyyymmdd docid string, the nnnn substring
ensures that the docid is unique in the corpus.
Material in this corpus is covered by the following copyrights:
Copyright 1994-2000, Korean Press Agency, All Rights Reserved