Hungarian-English Parallel Text, Version 1.0

Item Name: Hungarian-English Parallel Text, Version 1.0
Author(s): Dániel Varga, László Németh, Péter Halácsy, András Kornai, et al.
LDC Catalog No.: LDC2008T01
ISBN: 1-58563-461-1
ISLRN: 694-868-944-045-4
Release Date: January 22, 2008
Member Year(s): 2008
DCMI Type(s): Text
Data Source(s): web collection, varied, newswire, news magazine
Application(s): cross-lingual information retrieval, machine translation
Language(s): Hungarian
Language ID(s): hun
License(s): Hungarian-English Parallel Text, Version 1 Agreement
Online Documentation: LDC2008T01 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Varga, Dániel, et al. Hungarian-English Parallel Text, Version 1.0 LDC2008T01. Web Download. Philadelphia: Linguistic Data Consortium, 2008.

Introduction

Hungarian-English Parallel Text, Version 1.0 (also known as the "Hunglish Corpus") is a sentence-aligned Hungarian-English parallel corpus consisting of approximately two million sentence pairs. The corpus contains additional language resources for the Hungarian text, including a monolingual corpus, morphological toolset and aligner.

Hungarian-English Parallel Text, Version 1.0 is a joint work of the Media Research and Education Center at the Budapest University of Technology and Economics (BUTE) and the Corpus Linguistics Department at the Hungarian Academy of Sciences Institute of Linguistics.

Additional information about this release is available from the corpus website maintained by BUTE.

File formats, character encoding

This publication is issued on CD as a tarred zip file. Commonly available utilities such as Gnu Zip or Stuffit will readily extract this publication from its compressed form.

Sentence pair (.bi) files consist of tab-separated, matching sentence pairs. The .bi files do not contain segments where deletion or contraction occurred. They are also filtered based on quality, so the full reconstruction of the raw texts is impossible. Some .bi files were shuffled (sorted alphabetically).

Alignment "ladder" (.lad) files preserve the whole of both input texts with ordering, even those segments that were not successfully aligned. In .lad files, every line is tab-separated into two columns. The first is a segment of the Hungarian text. The second is a (supposedly corresponding) segment of the English text. Such segments of the source or target text will generally consist of exactly one sentence on both sides, but can also consist of zero, or more than one, sentence. In the latter case, the special separating token " ~~~ " is placed between sentences.

The encoding of the sentence pair and the alignment files is mixed: ISO Latin-2 on the Hungarian side, and ISO Latin-1 on the English side. The overwhelming majority of the texts use compatible subsets of these two encodings, so for viewing, the files can be considered ISO Latin-2 encoded.

hu and en are the raw texts used, in ISO Latin-2 and ISO Latin-1 encoding respectively.

Samples

For an example of the data contained in this corpus, please examine this sample screen capture of bilingual text.

Available Media

View Fees

Member
Non-Member
Reduced-License
Extra Copy
Login for the applicable fee