Publication title: Hunglish Corpus, version 1.0

 

Short description: Sentence-aligned Hungarian-English parallel corpus of about 2 million sentence pairs. Additionally provided for Hungarian: monolingual corpus, language resources, morphological toolset and aligner.

 

Authors:

 

Dániel Varga, daniel@mokk.bme.hu

László Németh, nemeth@mokk.bme.hu

Péter Halácsy, hp@mokk.bme.hu

András Kornai, kornai@mokk.bme.hu

Attila Vonyó, vonyoa@freemail.hu

Bálint Sass, joker@nytud.hu

Gergely Bottyán, bottyang@corpus.nytud.hu

Tamás Váradi, varadi@nytud.hu

Enikő Héja, eheja@nytud.hu

Ágnes Gyarmati, aagnes@nytud.hu

Ágnes Mészáros, magnes@corpus.nytud.hu

Dávid Labundy

Viktor Trón, v.tron@ed.ac.uk

Viktor Nagy, nagyv@nytud.hu

 

Contact: Dániel Varga, daniel@mokk.bme.hu

 

Data type: Parallel text

 

Data sources: legal text, literature, newspaper, technical manual, movie subtitles

 

Applications: machine translation, cross-lingual information retrieval

 

Languages: Hungarian (hun HU L), English (eng GB L, eng US L)

 

Grant number: IHM-ITEM 2003/76/6/2004, Hungarian Ministry of Informatics and Communications

 

Copyright:

 

Parallel data copyright (2005) by Budapest University of Technology and Economics, Hungary. Monolingual corpus copyright (2005) by the Hungarian Academy of Sciences Institute of Linguistics. Papers and documentation copyright (2005) by their authors.

 

License: All software in this collection is licensed under the Creative Commons GNU LGPL License. All other work in this collection (be it text, documentation or other data) is licensed under the Creative Commons Attribution 2.0 License.

 

Description of the corpus structure and data attributes:

 

Data Type: Parallel text

Number of files: 41255 under data, 41854 total

Size of the data: 1353 MB

Size of the data after compression: 546 MB (Some of the files are compressed with gzip.)

 

Total number of words in the parallel corpus: ~54.2 m, in ~2.07 m sentence pairs. Additionally provided Hungarian monolingual corpus: ~46.4 m words in ~3.00 m sentences.

 

URL to the project page: http://mokk.bme.hu/resources/hunglishcorpus

 

 

File formats, character encoding:

 

Sentence pair (.bi) files consist of tab-separated, matching sentence pairs. .bi files do not contain segments where deletion or contraction occurred. They are also filtered based on quality, so the full reconstruction of the raw texts is impossible from these files. Where copyright considerations made it necessary, the lines of .bi files were shuffled (sorted alphabetically).

 

Alignment "ladder" (.lad) files preserve the whole of both input texts with ordering, even those segments that were not successfully aligned. In .lad files, every line is tab-separated into two columns. The first is a segment of the Hungarian text. The second is a (supposedly corresponding) segment of the English text. Such segments of the source or target text will generally consist of exactly one sentence on both sides, but can also consist of zero, or more than one, sentence. In the latter case, the special separating token " ~~~ " is placed between sentences. The reserved special sentence "<p>" is used as a paragraph delimiter.

 

The encoding of the sentence pair and the alignment files is mixed: ISO Latin-2 on the Hungarian side, and ISO Latin-1 on the English side. The overwhelming majority of the texts use compatible subsets of these two encodings, so for viewing, the files can be considered ISO Latin-2 encoded.

 

hu and en are the raw texts used, in ISO Latin-2 and ISO Latin-1 encoding respectively.

 

 

The contents of each directory:

 

(Overview. For more details see the index.html files for each of the main directories.)

 

directory

size (MB)

contents

doc

4

Documentation

data

546

The data files sorted by genre

src

34

Source code for the ancillary software

lr

168

Language resources for Hungarian

bin

14

Precompiled binaries

 

The subdirectories of the data directory are as follows.

lit

Literature. For "classical" material no longer under copyright, the raw files came from Project Gutenberg and the Hungarian Electronic Library. For these, raw (en, hu), sentence pair (bi) and alignment (lad) files are available. For "modern" material still under copyright and made available to MOKK for research purposes, the sentence pair files are all shuffled together in bi/Shuffle,, and the other formats are not provided.

law

Legal texts. The EU_Law_1 to EU_Law_10269 raw files are from CELEX and are reproduced in full under the hu and en subdirectories. EU_Law_0 is the text of the EU Constitution. The sentence pair files are in the bi directory. Full alignments are not provided, due to lack of space.

swdoc

Software documentation. The raw files come from OpenOffice.org, Mozilla, Gnome, KDE, and other major FOSS (Free Open Source Software) projects. Both bi and lad files are provided.

film

Movie subtitles. The raw files were provided to MOKK for research purposes only and can not be republished on this CD. The sentence pair files are given in "shuffled" version: lines alphabetically sorted so as to make republishing of the original subtitle files impossible. This data segment has many spelling errors owing to OCR text extraction, and as a special aid for subselecting a higher quality dataset, sentence by sentence figures of alignment merit are provided in the file quality.

mag

Magazines and news. This material is still largely in preparation at the time the CD goes to press, please visit the Hunglish website for more.

mono

Monolingual Hungarian files taken from the Hungarian National Corpus: parliament and city council minutes, laws, regulations, and other "official" material as well as the archives of the chat rooms of a major Hungarian internet portal, index.hu. Please note that the tokenization of these files differs from the rest of the corpus inasmuch as both sentence-final punctuation and trailing periods after abbreviations are given as whitespace-separated tokens. This explains the discrepancy between the number of words and sentences quoted in the summary below and the number coming from wc.

Summary

directory

size (MB)

words

contents

raw text

film

18

3.27 m

move subtitles

no

law

233

31.53 m

EU law (CELEX)

full

lit

85

17.24 m

literature

when (C) lapsed

mag

5

0.36 m

magazines, news

no

swdoc

8

1.27 m

software documentation

full

mono

314

46.4 m

monolingual Hungarian

full

A more detailed inventory is available in the form of a catalog and wc files in each directory, the former providing author and title information (in both languages) and the latter containing the output of wc for each file. A file conf which summarizes our confidence in the alignment is also provided for each directory so that users can select a high-confidence subset of the corpus if they wish.

Notes:

This is a joint work of the Media Research and Education Center at the Budapest University of Technology and Economics (Dániel Varga, Péter Halácsy, András Kornai, László Németh, and Viktor Trón), and the Corpus Linguistics Department at the Hungarian Academy of Sciences Institute of Linguistics (Tamás Váradi, Bálint Sass, Gergő Bottyán, Enikő Héja, Ágnes Gyarmati, Ágnes Mészáros and Dávid Labundy).

András Aklán (BUTE) provided effective project management for the production process and Mike Maxwell (LDC) advised us on the structure of the CD and found many bugs. We thank Magyar Telekom Rt. for infrastructure support.