Word Break Correction Methodology
						   ================================================

									
The basic idea behind the word break correction methodology was that
we extracted all words except those at the beginning and at the end of
a line from our data set to construct a histogram.  The next step was 
to read all word at the end and beginning of a line as these were the
suspect words. If the word broken by the line break is found in the histogram 
i.e. a correct Korean word, we correct this line break error. Therefore,
for every word at the end of a line, we concatenate it with the beginning word
in the next line and see if the concatenated word exists in the histogram.

However, the problem is, almost every Korean word contains a
postscript at the end of it, so even if the two words are the same in
the basic form, it can have different spelling (Every part excluding the
postscript is the same). So if we produce the histogram with just the
words as they appear in the text, it might not be able to tell whether
a line broken word is a correct word if the word in the histogram has
the same basic form but a different postscript.

 In order to resolve this postscript issue we decided to store the
basic form of Korean words in the histogram. First we made of list of
Korean postscripts (126 entries) and sorted out in decreasing number
of letters. The reason for sorting is that many longer Korean
postscripts include shorter postscript in this list. For example, the
postscript, ???, contains separate postscripts ?? and ?. Therefore by
searching from longer to shorter postscript in a word, we can remove
the exact postscript that a word actually have. Then, whenever we find
a postscript(scanned from longer to smaller postscript) at the end of
a word, we save the word without the postscript in our histogram along
with the original word. The reason we store the original form is that
some words that are already in the basic form may contain words with
the same letters as a postscript.

For each text file, we choose a line that does not starts with "<" and
the next non-empty line. If that next line starts with "<" we skip
this pair of lines. So given two consecutive text lines, we take the
last word, word1,  in the upper line and the beginning word, word2,
in the lower line. If word1 is composed of just numbers, dot or comma,
it is mostly line broken(in Korean sentence, most numbers have
trailing word at the end), so we mark this case as "line broken". On
the other hand, if word1 is composed of words, but word2 is composed
of only numbers, dot or comma, the line break is correct in this case
because a number in a word always appear at the left of letters in
Korean if they are connected. Also if word1 only contains English
characters and word2 starts with alphabet, this is the case where an
English acronym is broken by line break. So we mark this case as line
broken.

After dealing with those special cases above, we used the histogram to
actually find out whether word1 concatenated with word2 is a correct
Korean word. First we just check if the concatenated word itself
exists in the histogram. If it does, we mark it as line broken.
Otherwise we search for a postscript in the concatenated words
starting from longer to shorter postscripts. If a postscript is
matched at the end of the word, we see if the basic form of the
concatenated word exists in the histogram. We mark this case as line
broken if the basic form is found.

Finally we correct all the line broken cases by replacing word1 with
the concatenated word and removing word2 in the next line and repeat
this process until we meet the end of the file.

Authors: Angelo Mendonca, Yeo Ho Yoon