README File for the Korean Newswire Second Edition
================================================
Second Edition
=============
Title: Korean Newswire Second Edition
CatalogID: LDC2010T19
Release date: October 15, 2010
Linguistic Data Consortium
Authors: Angelo Mendonca, Andy Cole, Kevin Walker, Denise DiPersio
INTRODUCTION:
------------
This CD contains the Korean Newswire Second Edition Corpus, produced by the
Linguistic Data Consortium (LDC); catalog number LDC2010T19, isbn
1-58563-564-2. This corpus consists of the all the contents of the previous
version LDC2000T45 (articles from June 2, 1994, to March 20, 2000) in addition to
the data collected from April 2000 to December 2009. However, there have been
time periods in the epoch of 2000-2002 wherein data was not collected.
More information regarding the corpus is available at the following
link: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC200T19
The corpus articles have been classified based on the content in the data into
four different type: story, multi and other.
multi: Article contains more than one story referring to different topics
other: Article holds lists, tables or scores.
story: Article contains information pertaining to a particular event on a day.
# DOCS Kwords TextKB
story
kpa_kor 217070 37546 371722
TOTAL 217070 37546 371722
multi
kpa_kor 31 21 239
TOTAL 31 21 239
other
kpa_kor 7318 1034 8375
TOTAL 7318 1034 8375
PACKAGE STRUCTURE:
-----------------
.
|-common_files
|---docs
|---dtd
|-kor_nw_p1v2
|---data
data: This directory contains the actual files of the corpus. All the data has been
converted to UTF-8.
The files contain data collected during the course of a month. For e.g
filename: kpa_kor_199406 contains data that was collected in the month of June for
the year 1994. Each of the docs contained in the files have a fixed sgml structure
governed by a dtd. The articles provided here have been collected by means of
a continuous feed from the news provider over a modem connection.
ncoming data from the modem was spooled directly to a "raw collection" file
on a daily basis, and the raw files were then processed to produce the following
format for release by the LDC.
All material, including that from the first release, has been converted to UTF-8
except for more recent data already in UTF-8 format) and processed in LDC's gigaword
format. The gigaword format classifies newswire content into three types: story,
multi and other where "story" refers to an article containing information pertaining
to a particular event on a day; "multi" refers to an article that contains more than
one story relating to different topics; and "other" refers to articles containing
lists, tables or numerical data, such as sports scores.
A word break error in the original release and in data collected from January 2002
through February 2005 has been corrected in the second edition with the result that
all Korean text should display correctly. The error involved a line break in the
middle of a word with the result that an affected word appeared in segments in two
lines. This problem was finally resolved using word histograms and a few common rules
based on heuristics from the data and has yielded a 90% - 95% word break correction
rate. Further information about the word break correction procedure is available in
Word_Break_Correction_Procedure.txt.
We have taken steps to remove articles that were corrupted by failures or noise
in data transmission. The kinds of corruption that we were able to eliminate
include truncated articles (a valid end-of-article sequence is not observed
before a valid start-of-article), and invalid character codes within the text
segment of articles. Some corruption may have occurred that did not produce
these symptoms (e.g. service interruptions that might cause partial loss of
data within or across articles, or corruptions that garble the content but
happen not to produce any invalid character codes). At present we have no
means for detecting these more subtle problems in the data, but we expect that
they are relatively infrequent.
The format chosen for release consists of SGML tagging (since this gives a
fairly simple and self-explanatory presentation of the data)
The SGML tagging is as follows:
# there may be several paragraphs within each article.
Korean Text.