README File for the Korean Newswire Second Edition ================================================ Second Edition ============= Title: Korean Newswire Second Edition CatalogID: LDC2010T19 Release date: October 15, 2010 Linguistic Data Consortium Authors: Angelo Mendonca, Andy Cole, Kevin Walker, Denise DiPersio INTRODUCTION: ------------ This CD contains the Korean Newswire Second Edition Corpus, produced by the Linguistic Data Consortium (LDC); catalog number LDC2010T19, isbn 1-58563-564-2. This corpus consists of the all the contents of the previous version LDC2000T45 (articles from June 2, 1994, to March 20, 2000) in addition to the data collected from April 2000 to December 2009. However, there have been time periods in the epoch of 2000-2002 wherein data was not collected. More information regarding the corpus is available at the following link: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC200T19 The corpus articles have been classified based on the content in the data into four different type: story, multi and other. multi: Article contains more than one story referring to different topics other: Article holds lists, tables or scores. story: Article contains information pertaining to a particular event on a day. # DOCS Kwords TextKB story kpa_kor 217070 37546 371722 TOTAL 217070 37546 371722 multi kpa_kor 31 21 239 TOTAL 31 21 239 other kpa_kor 7318 1034 8375 TOTAL 7318 1034 8375 PACKAGE STRUCTURE: ----------------- . |-common_files |---docs |---dtd |-kor_nw_p1v2 |---data data: This directory contains the actual files of the corpus. All the data has been converted to UTF-8. The files contain data collected during the course of a month. For e.g filename: kpa_kor_199406 contains data that was collected in the month of June for the year 1994. Each of the docs contained in the files have a fixed sgml structure governed by a dtd. The articles provided here have been collected by means of a continuous feed from the news provider over a modem connection. ncoming data from the modem was spooled directly to a "raw collection" file on a daily basis, and the raw files were then processed to produce the following format for release by the LDC. All material, including that from the first release, has been converted to UTF-8 except for more recent data already in UTF-8 format) and processed in LDC's gigaword format. The gigaword format classifies newswire content into three types: story, multi and other where "story" refers to an article containing information pertaining to a particular event on a day; "multi" refers to an article that contains more than one story relating to different topics; and "other" refers to articles containing lists, tables or numerical data, such as sports scores. A word break error in the original release and in data collected from January 2002 through February 2005 has been corrected in the second edition with the result that all Korean text should display correctly. The error involved a line break in the middle of a word with the result that an affected word appeared in segments in two lines. This problem was finally resolved using word histograms and a few common rules based on heuristics from the data and has yielded a 90% - 95% word break correction rate. Further information about the word break correction procedure is available in Word_Break_Correction_Procedure.txt. We have taken steps to remove articles that were corrupted by failures or noise in data transmission. The kinds of corruption that we were able to eliminate include truncated articles (a valid end-of-article sequence is not observed before a valid start-of-article), and invalid character codes within the text segment of articles. Some corruption may have occurred that did not produce these symptoms (e.g. service interruptions that might cause partial loss of data within or across articles, or corruptions that garble the content but happen not to produce any invalid character codes). At present we have no means for detecting these more subtle problems in the data, but we expect that they are relatively infrequent. The format chosen for release consists of SGML tagging (since this gives a fairly simple and self-explanatory presentation of the data) The SGML tagging is as follows:

# there may be several paragraphs within each article. Korean Text.

Consult the dtd for more information regarding the structure. Not all articles have information in all the tag fields. Within the units, tagging is kept to a minimum, typically consisting only of

tags to mark paragraph boundaries. The data files were validated using nsgmls and the kn.dtd. The unique KPA_KOR_yyyymmdd.nnnn string in the DOC tag : is intepreted in the manner described below. yyyy = Year mm = Month dd = Day nnnn = Sequence Number For all articles that share the same yyyymmdd docid string, the nnnn substring ensures that the docid is unique in the corpus. docs: Contains documentation including information pertaining to the corpus. dtd: You will find the dtd for the corpus in this location. Acknowledgements: The contributions of the following are gratefully acknowledged: Programmer and native Korean speaker Yeo Ho Yoon for his valuable work in resolving the word break error problem referred to above. Programmers David Graff and Robert Parker for their expert advice and review throughout the corpus development process. Material in this corpus is covered by the following copyrights: Portions © 1994-2009 Korean Press Agency, © 2000, 2010 Trustees of the University of Pennsylvania All Rights Reserved