Publication title: Hunglish Corpus, Version 1.0

Short description: Sentence-aligned Hungarian-English parallel corpus of about 2 million sentence pairs. Additionally provided for Hungarian: monolingual corpus, language resources, morphological toolset and aligner.

Authors:

Dániel Varga, daniel@mokk.bme.hu
László Németh, nemeth@mokk.bme.hu
Péter Halácsy, hp@mokk.bme.hu
András Kornai, kornai@mokk.bme.hu
Attila Vonyó
Bálint Sass, joker@nytud.hu
Tamás Váradi, joker@nytud.hu
Gergely Bottyán, joker@nytud.hu
Enikő Héja, eheja@nytud.hu
Ágnes Gyarmati
Ágnes Mészáros, magnes@corpus.nytud.hu
Dávid Labundy
Viktor Trón, v.tron@ed.ac.uk
Viktor Nagy, nagyv@nytud.hu

Contact: Dániel Varga, daniel@mokk.bme.hu

Data type: Parallel text

Data sources: legal text, literature, newspaper, technical manual, movie subtitles

Applications: machine translation, cross-lingual information retrieval

Languages: Hungarian (hun HU L), English (eng GB L, eng US L)

Grant number: IHM-ITEM 2003/76/6/2004, Hungarian Ministry of Informatics and Communications

Copyright:

Parallel data copyright (2005) by Budapest University of Technology and Economics, Hungary.
Monolingual corpus copyright (2005) by the Hungarian Academy of Sciences Institute of Linguistics.
Papers and documentation copyright (2005) by their authors.

License: All software in this collection is licensed under the Creative Commons GNU LGPL License. All other work in this collection (be it text, documentation or other data) is licensed under the Creative Commons Attribution 2.0 License.

Description of the corpus structure and data attributes:

Data Type: Parallel text
Number of files: 41255 under data, 41854 total
Size of the data: 1353 MB
Size of the data after compression: 427 MB (Some of the files are compressed with gzip.)

Total number of words in the parallel corpus: ~54.2 m, in ~2.07 m sentence pairs. Additionally provided Hungarian monolingual corpus: ~46.4 m words in ~3.00 m sentences.

URL to the project page: http://mokk.bme.hu/resources/hunglishcorpus


File formats, character encoding:

Sentence pair (.bi) files consist of tab-separated, matching sentence pairs. .bi files do not contain segments where deletion or contraction occurred. They are also filtered based on quality, so the full reconstruction of the raw texts is impossible from these files. Where copyright considerations made it necessary, the lines of .bi files were shuffled (sorted alphabetically).

Alignment "ladder" (.lad) files preserve the whole of both input texts with ordering, even those segments that were not successfully aligned. In .lad files, every line is tab-separated into two columns. The first is a segment of the Hungarian text. The second is a (supposedly corresponding) segment of the English text. Such segments of the source or target text will generally consist of exactly one sentence on both sides, but can also consist of zero, or more than one, sentence. In the latter case, the special separating token " ~~~ " is placed between sentences. The reserved special sentence "<p>" is used as a paragraph delimiter.

The encoding of the sentence pair and the alignment files is mixed: ISO Latin-2 on the Hungarian side, and ISO Latin-1 on the English side. The overwhelming majority of the texts use compatible subsets of these two encodings, so for viewing, the files can be considered ISO Latin-2 encoded.

hu and en are the raw texts used, in ISO Latin-2 and ISO Latin-1 encoding respectively.


The contents of each directory:

(Overview. For more details see the index.html files for each of the main directories.)

doc   Documentation
data 	The data files sorted by genre
src   Source code for the ancillary software
lr    Language resources for Hungarian
bin   Precompiled binaries

The subdirectories of the data directory are as follows.

lit
Literature. For "classical" material no longer under copyright, the raw files came from Project Gutenberg and the Hungarian Electronic Library. For these, raw (en, hu), sentence pair (bi) and alignment (lad) files are available. For "modern" material still under copyright and made available to MOKK for research purposes, the sentence pair files are all shuffled together in bi/Shuffle,, and the other formats are not provided.

law
Legal texts. The EU_Law_1 to EU_Law_10269 raw files are from CELEX and are reproduced in full under the hu and en subdirectories. EU_Law_0 is the text of the EU Constitution. The sentence pair files are in the bi directory. Full alignments are not provided, due to lack of space.

swdoc
Software documentation. The raw files come from OpenOffice.org, Mozilla, Gnome, KDE, and other major FOSS (Free Open Source Software) projects. Both bi and lad files are provided.

film
Movie subtitles. The raw files were provided to MOKK for research purposes only and can not be republished on this CD. The sentence pair files are given in "shuffled" version: lines alphabetically sorted so as to make republishing of the original subtitle files impossible. This data segment has many spelling errors owing to OCR text extraction, and as a special aid for subselecting a higher quality dataset, sentence by sentence figures of alignment merit are provided in the file quality.

mag
Magazines and news. This material is still largely in preparation at the time the CD goes to press, please visit the Hunglish website for more.

mono
Monolingual Hungarian files taken from the Hungarian National Corpus: parliament and city council minutes, laws, regulations, and other "official" material as well as the archives of the chat rooms of a major Hungarian internet portal, index.hu. Please note that the tokenization of these files differs from the rest of the corpus inasmuch as both sentence-final punctuation and trailing periods after abbreviations are given as whitespace-separated tokens. This explains the discrepancy between the number of words and sentences quoted in the summary below and the number coming from wc.

Summary
directory 	size (MB) 	words 		contents 		raw text
film 		18 		3.27 m 		movie subtitles 	no
law 		233 		31.53 m 	EU law (CELEX) 		full
lit 		85 		17.24 m 	literature 		when (C) lapsed
mag 		5 		0.36 m 		magazines, news 	no
swdoc		8 		1.27 m 		software documentation 	full
mono 		314 		46.4 m 		monolingual Hungarian 	full

A more detailed inventory is available in the form of a catalog and wc files in each directory, the former providing author and title information (in both languages) and the latter containing the output of wc for each file. A file conf which summarizes our confidence in the alignment is also provided for each directory so that users can select a high-confidence subset of the corpus if they wish.


Notes:

This is a joint work of the Media Research and Education Center at the Budapest University of Technology and Economics (Dániel Varga, Péter Halácsy, András Kornai, László Németh, and Viktor Trón), and the Corpus Linguistics Department at the Hungarian Academy of Sciences Institute of Linguistics (Tamás Váradi, Bálint Sass, Gergő Bottyán, Enikő Héja, Ágnes Gyarmati, Ágnes Mészáros and Dávid Labundy).

András Aklán (BUTE) provided effective project management for the production process and Mike Maxwell (LDC) advised us on the structure of the CD and found many bugs. We thank Magyar Telekom Rt. for infrastructure support.