This directory contains the documentation for the Hunglish CDROM.
The entire NLP toolchain that was used in creating the corpus began with extending and improving the Hungarian open source spellchecker Hunspell used in OpenOffice pdf ps as part of the WordSword (SzóSzablya) project, which is described here both in Hungarian, for the 1st Hungarian Computational Linguistics Conference held in November 2003 pdf ps, and in English, for LREC04 pdf ps, a paper that provides details on the frequency count included on this CD and the gigaword Hungarian corpus it is based on.
The tools, which include a stemmer, a morphological analyzer and a generator, as well as characterset-detection, normalization, and sentence-levele tokenization utilities, are described in a paper published at the SALTMIL 2004 workshop pdf ps and the following draft pdf ps to be presented at the ACL05 Software Workshop. For now, the system of morphological codes used in the morphological analyzer (pdf ps), and some other low-level aspects of the tools (pdf ps), are described in Hungarian, exceed this paper, submitted to Acta Cybernetica (pdf ps).