Publication title: Hunglish Corpus, version 1.0
Short description: Sentence-aligned Hungarian-English parallel corpus of about 2 million sentence pairs. Additionally provided for Hungarian: monolingual corpus, language resources, morphological toolset and aligner.
Authors:
Dániel Varga, daniel@mokk.bme.hu
László Németh, nemeth@mokk.bme.hu
Péter Halácsy, hp@mokk.bme.hu
András Kornai, kornai@mokk.bme.hu
Attila Vonyó, vonyoa@freemail.hu
Bálint Sass, joker@nytud.hu
Gergely Bottyán, bottyang@corpus.nytud.hu
Tamás Váradi, varadi@nytud.hu
Enikő Héja, eheja@nytud.hu
Ágnes Gyarmati, aagnes@nytud.hu
Ágnes Mészáros, magnes@corpus.nytud.hu
Dávid Labundy
Viktor Trón, v.tron@ed.ac.uk
Viktor Nagy, nagyv@nytud.hu
Contact: Dániel Varga, daniel@mokk.bme.hu
Data type: Parallel text
Data sources: legal text, literature, newspaper, technical manual, movie subtitles
Applications: machine translation, cross-lingual information retrieval
Languages: Hungarian (hun HU L), English (eng GB L, eng US L)
Grant number: IHM-ITEM 2003/76/6/2004, Hungarian Ministry of Informatics and Communications
Copyright:
Parallel data copyright (2005) by
Budapest University of Technology and
License: All software in this collection is licensed under the Creative Commons GNU LGPL License. All other work in this collection (be it text, documentation or other data) is licensed under the Creative Commons Attribution 2.0 License.
Description of the corpus structure and data attributes:
Data Type: Parallel text
Number of files: 41255 under data, 41854 total
Size of the data: 1353 MB
Size of the data after compression: 546 MB (Some of the files are compressed with gzip.)
Total number of words in the parallel corpus: ~54.2 m, in ~2.07 m sentence pairs. Additionally provided Hungarian monolingual corpus: ~46.4 m words in ~3.00 m sentences.
URL to the project page: http://mokk.bme.hu/resources/hunglishcorpus
File formats, character encoding:
Sentence pair (.bi) files consist of tab-separated, matching sentence pairs. .bi files do not contain segments where deletion or contraction occurred. They are also filtered based on quality, so the full reconstruction of the raw texts is impossible from these files. Where copyright considerations made it necessary, the lines of .bi files were shuffled (sorted alphabetically).
Alignment "ladder" (.lad) files preserve the whole of both input texts with ordering, even those segments that were not successfully aligned. In .lad files, every line is tab-separated into two columns. The first is a segment of the Hungarian text. The second is a (supposedly corresponding) segment of the English text. Such segments of the source or target text will generally consist of exactly one sentence on both sides, but can also consist of zero, or more than one, sentence. In the latter case, the special separating token " ~~~ " is placed between sentences. The reserved special sentence "<p>" is used as a paragraph delimiter.
The encoding of the sentence pair and the alignment files is mixed: ISO Latin-2 on the Hungarian side, and ISO Latin-1 on the English side. The overwhelming majority of the texts use compatible subsets of these two encodings, so for viewing, the files can be considered ISO Latin-2 encoded.
hu and en are the raw texts used, in ISO Latin-2 and ISO Latin-1 encoding respectively.
The contents of each directory:
(Overview. For more details see the index.html files for each of the main directories.)
directory |
size (MB) |
contents |
4 |
Documentation |
|
546 |
The data files sorted by genre |
|
34 |
Source code for the ancillary software |
|
168 |
Language resources for Hungarian |
|
14 |
Precompiled binaries |
The subdirectories of the data
directory are as follows.
Literature. For "classical" material no longer under copyright, the raw files came from Project Gutenberg and the Hungarian Electronic Library. For these, raw (en, hu), sentence pair (bi) and alignment (lad) files are available. For "modern" material still under copyright and made available to MOKK for research purposes, the sentence pair files are all shuffled together in bi/Shuffle,, and the other formats are not provided.
Legal texts. The EU_Law_1 to EU_Law_10269 raw files are from CELEX and are reproduced in full under the hu and en subdirectories. EU_Law_0 is the text of the EU Constitution. The sentence pair files are in the bi directory. Full alignments are not provided, due to lack of space.
Software documentation. The raw files come from OpenOffice.org, Mozilla, Gnome, KDE, and other major FOSS (Free Open Source Software) projects. Both bi and lad files are provided.
Movie subtitles. The raw files were provided to MOKK for research purposes only and can not be republished on this CD. The sentence pair files are given in "shuffled" version: lines alphabetically sorted so as to make republishing of the original subtitle files impossible. This data segment has many spelling errors owing to OCR text extraction, and as a special aid for subselecting a higher quality dataset, sentence by sentence figures of alignment merit are provided in the file quality.
Magazines and news. This material is still largely in preparation at the time the CD goes to press, please visit the Hunglish website for more.
Monolingual Hungarian files taken
from the Hungarian National Corpus: parliament and city council minutes, laws,
regulations, and other "official" material as well as the archives of
the chat rooms of a major Hungarian internet portal, index.hu. Please note that
the tokenization of these files differs from the rest of the corpus inasmuch as
both sentence-final punctuation and trailing periods after abbreviations are
given as whitespace-separated tokens. This explains
the discrepancy between the number of words and sentences quoted in the summary
below and the number coming from wc.
directory |
size (MB) |
words |
contents |
raw text |
18 |
|
move subtitles |
no |
|
233 |
|
EU law (CELEX) |
full |
|
85 |
|
literature |
when (C) lapsed |
|
5 |
|
magazines, news |
no |
|
8 |
|
software documentation |
full |
|
314 |
|
monolingual Hungarian |
full |
A more detailed inventory
is available in the form of a catalog and wc files in each directory, the former providing author and title
information (in both languages) and the latter containing the output of wc for each file. A file conf which summarizes our confidence in
the alignment is also provided for each directory so that users can select a
high-confidence subset of the corpus if they
wish.
Notes:
This is a joint
work of the Media
Research and Education Center at the Budapest
University of Technology and Economics (Dániel Varga, Péter Halácsy, András Kornai, László Németh, and Viktor
Trón), and the Corpus
Linguistics Department at the Hungarian Academy of Sciences Institute of Linguistics (Tamás
Váradi, Bálint Sass, Gergő Bottyán, Enikő Héja, Ágnes Gyarmati, Ágnes Mészáros
and Dávid Labundy).
András
Aklán (BUTE) provided effective project management for the production process
and Mike Maxwell (LDC) advised us on the structure of the CD and found many
bugs. We thank Magyar Telekom Rt. for
infrastructure support.