Overview of the UN Parallel Text Corpus


Version 1.0

by David Graff
Linguistic Data Consortium
University of Pennsylvania

March 3, 1994 This text provides background information on the collection and preparation of the UN Parallel Text data, and describes the directory structure used to organize the corpus.

The text files published in this corpus were provided to the LDC by the United Nations in New York, for use by the research community in developing machine translation technology. This material has been drawn from the UN's electronic text archives covering the period between 1988 and (portions of) 1993. The release of UN text data for research purposes was initially pursued by Dragon Systems, Inc., of Newton, MA, who obtained the first archival tapes from the UN in the fall of 1993. Shortly thereafter, the task of extracting the text data from the UN's archival format and organizing it for publication was taken on by the LDC.

The total pool of archival data from which the parallel corpus was drawn consisted of about 2.5 gigabytes of text in English, French and Spanish. The data was divided among some 92,000 files, which were arranged on 40 archival volumes (13 or 14 volumes per language). The file naming conventions, directory structure and partitioning into volumes did not provide any direct indication of parallel relations among the files in the three languages. The identification of parallel sets has been based entirely on inspection of the contents of the files.

Due to the quantities involved, it was necessary to look for some means of identifying parallels automatically, with minimal recourse to manual inspection. A largely automatic method was developed that was based on a general practice among UN typists of providing a uniform symbolic title string for each document in a special header block of each text file. This symbolic title, which was usually located at a fixed position in the file, was intended to be identical in all translations of the associated document. Another convention involved the assignment of a unique identifying number to each translation of a document, and entering this number at another fixed position in each file; these numbers were generally assigned to translations in a fixed sequence: English first, French second, Spanish fourth. (Presumably, a Russian translation was third, but we have not yet received the Russian language archives.) Based on the combined usage of title strings and document numbers, it was possible to identify parallel sets amounting to over 60% of the data in the archive (a total of 56684 files in 21986 parallel sets). We have yet to find a reasonable method for doing a more careful search for parallels in the remaining 40%. Part of this residue is due to the fact that this corpus contains only English-based parallel sets; parallel sets that included only French and Spanish versions have not been included in this release.

Users of this corpus must be warned that the parallel sets identified by this automatic method will include errors. We have observed a number of cases (over 700 in the corpus as a whole) where the members of a parallel set show a serious discrepancy in quantity of text. Also, we must expect that at least some of these sets (and perhaps some less obvious cases) constitute a complete mismatch. The "reftable" files in the "tables" directory give an indication of the relative consistency among members of parallel set in terms of overall size. From these tables, the least likely candidates for parallelism can be easily identified.

The reasons for these problems in identifying parallels involve the fact that the fixed-length, fixed-position fields within the original UN files, where the title strings and document numbers were stored, were part of a non-printing header structure -- these fields were probably filled out initially (perhaps hurriedly) when the file was first created, and then were never subject to further inspection or correction (whereas the printed portion of each file was subject to very careful scrutiny by UN editors). This also helps to explain some of the rather large residue of files for which parallels were not found -- the conventions for filling in these "hidden" fixed fields were not strictly enforced and were not checked by UN personel for correctness.

Prior to conducting the search for parallel sets, the data had to be transformed from the format provided by the UN, into a format that would be generally accessible to researchers. The data were received on 1/4-inch tape cartridges in Wang BACKUP format, and text files were stored in the proprietary word-processing format of Wang "WP". Specialized programs were written to extract the files from the backup format, and then to extract the text from the Wang WP files. The text was transliterated from a Wang-specific character set into ISO 8859-1 (Latin1), an 8-bit character set in which accented letters and some other specialized characters are encoded using byte values between 160 and 255 (0xa0 and 0xff). In addition, as much of the layout formatting information as possible has been retained as SGML markup in the extracted text data. The resulting files have been carefully checked for SGML parsability, and a functioning SGML DTD (document type definition) has been included in this directory. (Other files in this directory provide full details on the SGML markup employed, as well as information on how to remove the markup from the data.)

Another practice of UN typists that required special attention was the splitting of very large documents into two or more Wang WP files. Such file sets had to be identified and recombined in proper sequence wherever possible, in order to produce valid parallel sets for these documents. The main problem posed by multi-file documents was that the dividing points tended to differ from one translation to the next within a parallel set.

The process of identifying parallels involved both an automatic and a manual stage. The automatic stage was done first, and identified the vast majority of sets published here. The manual stage involved scanning the residue from the automatic process, to look for candidate sets on the basis of document numbers alone. A specialized X Window application was developed to allow rapid visual inspection of the candidate sets, and the operate then decided to accept or reject the candidates, based either on the title string alone, or on closer inspection of the actual text contents. In many cases, minor typographic differences in the title fields (e.g. dashes in one string corresponding to periods in another) were sufficient evidence for accepting a set as parallel. We sometimes found some very perplexing candidate sets, however, and discovered that the UN's document numbering system was not fool-proof: sometimes the same number was found on very different documents in different languages; there may be some cases of valid parallel sets in which the typical sequencing of UN numbers does not occur; and of course, many documents in the residue portion were found to have no number assigned (and no distinctive title either).

The organization of the parallel corpus was setup as follows. The archives were segregated according to year (a two-digit year code was part of the UN's document number for each translation). A new unique five-digit sequence number was assigned to each file in the English archive, starting at 00001 within each year. Tables were created containing a one-line entry for each file, with all files from the three languages grouped together in one table for each year. Each entry contained the original UN file name, the title string, the UN document number, and the newly assigned sequence number (for English entries) or "00000" (for French and Spanish). The tables were sorted according to the UN numbers, and were used as input to the two-stage process of identifying parallels. Once the parallel entries were extracted from the tables, the French and Spanish lines associated with each English file were modified to replace their "00000" with the English file's sequence number. This string was then used to assign a new name to each file in the parallel set, with the 2-digit year appended at the front, and the three-letter language code attached as the filename extension. Thus a parallel file set looks like this:

	89_00031.eng
	89_00031.fre
	89_00031.spa
The texts have been partitioned onto cdroms by language; within each language/disc, the texts are divided according to year. Since a given year might contain over 5000 files in one language, each year is further subdivided according to the first two digits of the sequence number; for example, the complete directory path for first of the three files listed above is:
	eng/89/00/89_00031.eng
and similarly for the other two, substituting "fre" or "spa" for "eng" at both the beginning and the end of the string.