Korean Newswire Second Edition, Linguistic Data Consortium (LDC) catalog number LDC2010T19 and isbn 1-58563-564-2, is an archive of Korean newswire text that has been acquired over several years (1994-2009) at LDC from the Korean Press Agency. This release includes all of the content of Korean Newswire LDC2000T45 (June 1994-March 2000) as well as newly-collected data.
New in the Second Edition
The second edition contains all data collected by LDC from April 2000 through December 2009.
All material, including that from the first release, has been converted to UTF-8 (except for more recent data already in UTF-8 format) and processed in LDCs gigaword format. The gigaword format classifies newswire content into three types: story, multi and other where story refers to an article containing information pertaining to a particular event on a day multi refers to an article that contains more than one story relating to different topics and other refers to articles containing lists, tables or numerical data, such as sports scores.
A word break error in the original release and in data collected from January 2002 through February 2005 has been corrected in the second edition with the result that all Korean text should display correctly. The error involved a line break in the middle of a word with the result that an affected word appeared in segments in two lines. This problem was resolved using word histograms and a few common rules based on heuristics from the data and has yielded a 90% - 95% word break correction rate. Further information about the word break correction procedure is available in Word_Break_Correction_Procedure.txt.
The following table shows for each gigaword classification, the number of documents in the classification (# DOCS), the number of space-separated word tokens in the text (K-WORDS) and the uncompressed file size in kilobytes (TextKB):
| # DOCS || K-WORDS || TextKB |
| story || 217052 || 37546 || 371722 |
| multi || 31 || 21 || 239 |
| other || 7318 || 1034 || 8375 |
The directory structure of the corpus is as follows: . |-common_files |---docs |---dtd |-kor_nw_p1v2 |---data data: This directory contains the corpus files. Each file contains data collected during the course of a month. For example, the filename kpa_kor_199406 contains data collected in June 1994. Each document in a file has a fixed sgml structure governed by a dtd. The SGML tagging is as follows: Consult the dtd for more information regarding the sgml structure of a single article. Not all articles have information in all the tag fields. The dtd mandates that every article must have a DOC tag and a BODY tag. The HEADLINE, DATELINE and P tags are optional. Within the units, tagging is kept to a minimum, typically consisting only of tags to mark paragraph boundaries. The unique KPA_KOR_yyyymmdd.nnnn string in the DOC tag : is intepreted in the manner described below. yyyy = Year mm = Month dd = Day nnnn = Sequence NumberFor all articles that share the same yyyymmdd docid string, the nnnn substring ensures that the docid is unique in the corpus.
docs: Contains corpus documentation. dtd: Contains the dtd for the corpus.
For an example of the data in this corpus, please review this sample file.
Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2010T19.
Portions © 1994-2009 Korean Press Agency, © 2000, 2010 Trustees of the University of Pennsylvania