|LDC94T4A - Complete UN Parallel Text corpus LDC94T4B-1 - English text only LDC94T4B-2 - French text only LDC94T4B-3 - Spanish text only |
This set of three compact discs contains documents provided to the LDC by the United Nations, for use in research on machine translation technology. The documents come from the Office of Conference Services at the UN in New York and are drawn from archives that span the period between 1988 and 1993.
This publication contains the English, French and Spanish archives, with data from each language stored on a separate disc in the set. Care has been taken to arrange the document files in a parallel directory structure for each language, so that corresponding translations of a document are found directly by means of the directory paths and file names.
All parallel files in this corpus are English-based: for every file on the English disc, there will be a corresponding file on either the French or Spanish disc, or both. Tables are included on all discs to assist in determining which parallels are present. Due to the nature and organization of UN translation services and the original electronic text archives, the process of finding and sorting out parallel documents yielded a numerous gaps, with many files in each language having no parallel in other languages.
In preparing the text for publication, we have applied a fully-compliant SGML format (Standard Generalized Markup Language). For those researchers who use SGML, a working DTD (Document Type Definition) is provided on each disc. For those who do not need SGML markup, a simple script is included that can be used to filter out the SGML-specific material and leave only the plain text. The character set used is the 8-bit ISO 8859-1 Latin1, in which accented letters and some other non-ASCII characters occupy the upper 128 entries of the character table.
Portions © 1988-1993 United Nations, © 1994 Trustees of the University of Pennsylvania