Word Break Correction Methodology ================================================ The basic idea behind the word break correction methodology was that we extracted all words except those at the beginning and at the end of a line from our data set to construct a histogram. The next step was to read all word at the end and beginning of a line as these were the suspect words. If the word broken by the line break is found in the histogram i.e. a correct Korean word, we correct this line break error. Therefore, for every word at the end of a line, we concatenate it with the beginning word in the next line and see if the concatenated word exists in the histogram. However, the problem is, almost every Korean word contains a postscript at the end of it, so even if the two words are the same in the basic form, it can have different spelling (Every part excluding the postscript is the same). So if we produce the histogram with just the words as they appear in the text, it might not be able to tell whether a line broken word is a correct word if the word in the histogram has the same basic form but a different postscript. In order to resolve this postscript issue we decided to store the basic form of Korean words in the histogram. First we made of list of Korean postscripts (126 entries) and sorted out in decreasing number of letters. The reason for sorting is that many longer Korean postscripts include shorter postscript in this list. For example, the postscript, ???, contains separate postscripts ?? and ?. Therefore by searching from longer to shorter postscript in a word, we can remove the exact postscript that a word actually have. Then, whenever we find a postscript(scanned from longer to smaller postscript) at the end of a word, we save the word without the postscript in our histogram along with the original word. The reason we store the original form is that some words that are already in the basic form may contain words with the same letters as a postscript. For each text file, we choose a line that does not starts with "<" and the next non-empty line. If that next line starts with "<" we skip this pair of lines. So given two consecutive text lines, we take the last word, word1, in the upper line and the beginning word, word2, in the lower line. If word1 is composed of just numbers, dot or comma, it is mostly line broken(in Korean sentence, most numbers have trailing word at the end), so we mark this case as "line broken". On the other hand, if word1 is composed of words, but word2 is composed of only numbers, dot or comma, the line break is correct in this case because a number in a word always appear at the left of letters in Korean if they are connected. Also if word1 only contains English characters and word2 starts with alphabet, this is the case where an English acronym is broken by line break. So we mark this case as line broken. After dealing with those special cases above, we used the histogram to actually find out whether word1 concatenated with word2 is a correct Korean word. First we just check if the concatenated word itself exists in the histogram. If it does, we mark it as line broken. Otherwise we search for a postscript in the concatenated words starting from longer to shorter postscripts. If a postscript is matched at the end of the word, we see if the basic form of the concatenated word exists in the histogram. We mark this case as line broken if the basic form is found. Finally we correct all the line broken cases by replacing word1 with the concatenated word and removing word2 in the next line and repeat this process until we meet the end of the file. Authors: Angelo Mendonca, Yeo Ho Yoon