Description of the Contents of the Tables Directory Version 1.0 by David Graff Linguistic Data Consortium University of Pennsylvania March 3, 1994 Each disc in the UN Parallel Text Corpus contains a "tables" directory, which, like the "doc" directory, has identical content on all three discs. There are two sets of tables provided: the "dircount.*" tables give a summary of the number of document files contained in the corpus, sorted by language, year and subdirectory; the "reftable.*" tables list, for each year, the languages present in each parallel set, along with a rough measure of data proportions in each file of a parallel set. As their names imply, the four "dircount" tables give information on the number of files in each language individually, and on all languages combined ("dircount.all" is simply the combined content of the other three). Each "reftable" file contains one line for each parallel set in a given year. The first column of each line gives the common part of the file name for the parallel set (the parallel series number), and following columns indicate which languages have a file in that set. For each file present in the set, a percentage value is given that shows the relative proportion of data in the file. The percentages were derived by removing all SGML markup from the files in the set, counting the number of lines in each file, and then dividing this by the sum of lines in all files of the set. The following example, drawn from "reftable.89", will illustrate how to interpret this information: 89_00028 E 38 S 62 89_00031 E 32 F 33 S 34 89_00032 E 48 F 52 The first line indicates a parallel set in which there is no French version of the document, the second shows a complete three-way set, and the last shows a case where the Spanish version is absent. The top line shows a relatively large disparity in quantity between the members of the set (the Spanish is about 1.6 times larger than the English file in terms of the number of lines of text), while the three-way set shows the closest match in quantity across languages. This publication of parallel texts is English-based; the English- version document file is present in all parallel sets. Researchers who want to focus on just English-French (or just English-Spanish) translation, can determine which parallel sets to use by simply extracting all lines from the "reftable" files that contain "F" (or those that contain "S"). Those interested in French-Spanish translation can focus on the somewhat smaller set of lines containing both "F" and "S". In both the "dircount" and "reftable" files, there is no indication as to the actual amount of text data in each directory or parallel set. Individual document files range in size between 1 K and 400 K bytes; many of the larger files include quite extensive sections of tabular material (e.g. tables of contents, budgets, listings of nations with various socio- or econometric data, etc). And, as one would expect from diplomatic discourse, there is a noticeable amount of formulaic phrasing in the non-tabular material.