UN Parallel Text (Complete)
|Item Name:||UN Parallel Text (Complete)|
|LDC Catalog No.:||LDC94T4A|
|Data Source(s):||government documents|
|Language(s):||French, English, Spanish|
|Language ID(s):||fra, eng, spa|
UN Parallel Text Agreement
|Online Documentation:||LDC94T4A Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Graff, David. UN Parallel Text (Complete) LDC94T4A. Web Download. Philadelphia: Linguistic Data Consortium, 1994.|
UN Parallel Text (Complete) contains English, French and Spanish official documents provided to the Linguistic Data Consortium (LDC) by the United Nations (UN) for use in research on machine translation technology. The documents are from achives maintained by the UN Office of Conference Services in New York and span the period 1988-1993.
The following individual releases by language are also available from LDC:
LDC94T4B-1 UN Parallel Text (English)
LDC94T4B-2 UN Parallel Text (French)
LDC94T4B-3 UN Parallel Text (Spanish)
All parallel files in this corpus are English-based: for every file in the English directory, there is a corresponding file in either the French or Spanish directory, or both. Tables are included to assist in determining which parallels are present. Similarly, the documents are arrranged in a parallel directory structure for each language so that corresponding translations of a document are found directly by means of the directory paths and file names.
The total content by number of words (milllions) per language is summarized below (values are approximate):
French/Spanish parallel data: 12,70038 (per language)
An SGML (Standard Generalized Markup Language) tagging structure was applied to the text. It preserves all typographic and meta-information present in the UN archival files. For using SGML, a working DTD (Document Type Definition) is provided. If SGML is not used, a simple script is included for use with the sed (stream-editor) utility to filter out SGML-specific material and meta-information, leaving only the plain text.
The character set is 8-bit ISO 8859-1 Latin1, in which accented letters and some other non-ASCII characters occupy the upper 128 entries of the character table.
Parallel samples of the three languages in this publication are listed below.
Based on the combined usage of title strings and document numbers, parallel sets amounting to over 60% of the data in the archive (a total of 56,684 files in 21,986 parallel sets) were identified. Parallel sets in the remaining 40% were not identified, due in part to the fact that this data set contains only English-based parallel sets. Parallel sets that include only French and Spanish versions are not part of this release.
Parallel sets identified by this automatic method include errors. A number of cases (over 700 in the corpus as a whole) where the members of a parallel set show a serious discrepancy in quantity of text were observed. Also, some of these sets (and perhaps some less obvious cases) constitute a complete mismatch. The reftable files in the tables directory provide an indication of the relative consistency among members of parallel set in terms of overall size. From these tables, the least likely candidates for parallelism can be identified.