|LDC94T4A - Complete UN Parallel Text corpus LDC94T4B-1 - English text only LDC94T4B-2 - French text only LDC94T4B-3 - Spanish text only |
This set of three compact discs contains documents provided to the LDC by the United Nations, for use in research on machine translation technology. The documents come from the Office of Conference Services at the UN in New York and are drawn from archives that span the period between 1988 and 1993.
This publication contains the English, French and Spanish archives, with data from each language stored on a separate disc in the set. Care has been taken to arrange the document files in a parallel directory structure for each language, so that corresponding translations of a document are found directly by means of the directory paths and file names.
All parallel files in this corpus are English-based: for every file on the English disc, there will be a corresponding file on either the French or Spanish disc, or both. Tables are included on all discs to assist in determining which parallels are present. The total content by language is summarized below (values are approximate): No. of Millions Language documents of words ------------------------------------- English22,00059 French20,00058 Spanish14,40048 French/Spanish parallel data12,70038 (per language) -------------------------------------
In preparing the text for publication, we have applied a SGML tagging (Standard Generalized Markup Language) that preserves all typographic and meta-information that was present in the UN archival files. For those researchers who use SGML, a working DTD (Document Type Definition) is provided on each disc. For those who do not need SGML markup, a simple script is included, for use with the sed (stream-editor) utility, that will filter out the SGML-specific material and meta-information, leaving only the plain text. (Sed is a standard utility on unix systems, and is also available as free software for MS-based systems). The character set used is the 8-bit ISO 8859-1 Latin1, in which accented letters and some other non-ASCII characters occupy the upper 128 entries of the character table.
Parallel samples of the three languages in this publication are listed below.
Based on the combined usage of title strings and document numbers, it was possible to identify parallel sets amounting to over 60% of the data in the archive (a total of 56,684 files in 21,986 parallel sets). We have yet to find a reasonable method for doing a more careful search for parallels in the remaining 40%. Part of this residue is due to the fact that this corpus contains only English-based parallel sets parallel sets that included only French and Spanish versions have not been included in this release.
Users of this corpus must be warned that the parallel sets identified by this automatic method will include errors. We have observed a number of cases (over 700 in the corpus as a whole) where the members of a parallel set show a serious discrepancy in quantity of text. Also, we must expect that at least some of these sets (and perhaps some less obvious cases) constitute a complete mismatch. The reftable files in the tables directory give an indication of the relative consistency among members of parallel set in terms of overall size. From these tables, the least likely candidates for parallelism can be easily identified.
Portions © 1988-1993 United Nations, © 1994 Trustees of the University of Pennsylvania