README File for the Korean Newswire Second Edition
						   ================================================

									Second Edition
									=============
Title: Korean Newswire Second Edition
CatalogID: LDC2010T19
Release date: October 15, 2010
Linguistic Data Consortium
Authors: Angelo Mendonca, Andy Cole, Kevin Walker, Denise DiPersio

									
INTRODUCTION:
------------

This CD contains the Korean Newswire Second Edition Corpus, produced by the
Linguistic Data Consortium (LDC); catalog number LDC2010T19, isbn
1-58563-564-2.  This corpus consists of the all the contents of the previous 
version LDC2000T45 (articles from June 2, 1994, to March 20, 2000) in addition to
the data collected from April 2000 to December 2009. However, there have been 
time periods in the epoch of 2000-2002 wherein data was not collected.

More information regarding the corpus is available at the following
link: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC200T19

The corpus articles have been classified based on the content in the data into
four different type: story, multi and other.

multi: Article contains more than one story referring to different topics

other: Article holds lists, tables or scores.

story: Article contains information pertaining to a particular event on a day.
 
          #    DOCS   Kwords   TextKB
story
  kpa_kor   217070    37546   371722
  TOTAL     217070    37546   371722		  
		  
multi
  kpa_kor       31       21      239
  TOTAL         31       21      239

other
  kpa_kor     7318	    1034     8375
  TOTAL       7318     1034     8375


PACKAGE STRUCTURE:
-----------------

   .
   |-common_files
   |---docs
   |---dtd
   |-kor_nw_p1v2
   |---data


data: This directory contains the actual files of the corpus. All the data has been
converted to UTF-8.

The files contain data collected during the course of a month. For e.g
filename: kpa_kor_199406 contains data that was collected in the month of June for
the year 1994. Each of the docs contained in the files have a fixed sgml structure
governed by a dtd. The articles provided here have been collected by means of 
a continuous feed from the news provider over a modem connection. 
ncoming data from the modem was spooled directly to a "raw collection" file
on a daily basis, and the raw files were then processed to produce the following
format for release by the LDC.

All material, including that from the first release, has been converted to UTF-8 
except for more recent data already in UTF-8 format) and processed in LDC's gigaword
format. The gigaword format classifies newswire content into three types: story,
multi and other where "story" refers to an article containing information pertaining
to a particular event on a day; "multi" refers to an article that contains more than
one story relating to different topics; and "other" refers to articles containing
lists, tables or numerical data, such as sports scores. 

A word break error in the original release and in data collected from January 2002
through February 2005 has been corrected in the second edition with the result that 
all Korean text should display correctly. The error involved a line break in the 
middle of a word with the result that an affected word appeared in segments in two 
lines. This problem was finally resolved using word histograms and a few common rules 
based on heuristics from the data and has yielded a 90% - 95% word break correction 
rate. Further information about the word break correction procedure is available in 
Word_Break_Correction_Procedure.txt. 

We have taken steps to remove articles that were corrupted by failures or noise
in data transmission.  The kinds of corruption that we were able to eliminate
include truncated articles (a valid end-of-article sequence is not observed
before a valid start-of-article), and invalid character codes within the text
segment of articles.  Some corruption may have occurred that did not produce
these symptoms (e.g. service interruptions that might cause partial loss of
data within or across articles, or corruptions that garble the content but
happen not to produce any invalid character codes).  At present we have no
means for detecting these more subtle problems in the data, but we expect that
they are relatively infrequent.

The format chosen for release consists of SGML tagging (since this gives a
fairly simple and self-explanatory presentation of the data)

The SGML tagging is as follows:

<DOC id="KPA_KOR_yyyymmdd.nnnn" type="story|multi|other">
<HEADLINE></HEADLINE>
<DATELINE></DATELINE>
<BODY>
    <TEXT>
    <P> # there may be several paragraphs within each article.
      Korean Text. 
    </P> 
  </TEXT>
</BODY>
</DOC>

Consult the dtd for more information regarding the structure.

Not all articles have information in all the tag fields.  Within the <TEXT>
units, tagging is kept to a minimum, typically consisting only of <P></P> tags
to mark paragraph boundaries. The data files were validated using nsgmls and
the kn.dtd.

The unique KPA_KOR_yyyymmdd.nnnn string in the DOC tag : <DOC id"KPA_KOR_yyyymmdd.nnnn" type="story|multi|other"> 
is intepreted in the manner described below.

 yyyy	= Year
 mm 	= Month
 dd	= Day
 nnnn	= Sequence Number
 
For all articles that share the same yyyymmdd docid string, the nnnn substring
ensures that the docid is unique in the corpus.


docs: Contains documentation including information pertaining to the corpus.

dtd: You will find the dtd for the corpus in this location.

Acknowledgements:

The contributions of the following are gratefully acknowledged:

Programmer and native Korean speaker Yeo Ho Yoon for his valuable work in resolving the word break error problem referred to above.

Programmers David Graff and Robert Parker for their expert advice and review throughout the corpus development process.

Material in this corpus is covered by the following copyrights:

Portions � 1994-2009 Korean Press Agency, � 2000, 2010 Trustees of the University of Pennsylvania All Rights Reserved