Multilingual Corpora in ECI/MCI I. Introduction The thirteen multilingual corpora contained within ECI/MCI fall into a small number of structural types, which in turn determine how they can be accessed. This document outlines the different types and how they are structured. II. Parallel Sub-corpora The corpora mul04 and mul05 consist of a number of parallel subcorpora, each contained in a separate subdirectory (e.g. msp04), with its own .sgm, .ent and .eci files. The mulnn.sgm files themselves include these sub-corpora in the same way that mci.sgm includes the component corpora of ECI/MCI, that is, as SUBDOCs. Mul12 is not really parallel at all, being comparable material assembled by one site about the Danish, English and Spanish legal systems, but it is structured in a similar way to mul04 and mul05. III. More or less parallel files The corpora mul03, mul06, mul08 and mul09 consist of a number of parallel files, one per language, all located in the same directory. This directory and its files are structured similarly to those for a monolingual corpus, so e.g. mul03a.eci, mul03b.eci and mul03c.eci are the German, French and Italian versions respectively of the mul03 material. To provide for aliasing (see below) subdirectories (e.g. mfr03) are provided, with links to the parent directory for both .eci and .ent files -- a seperate .sgm file is provided in each subdirectory for access therefrom to a single language. The corpora mul01 and mul02 are similar but less parallel. They each consist of a collection of varied material in various languages. Mul01 is organised into subcomponents by material, with individual files in different languages, while mul02 is organised into subcomponents by language. Again subdirectories with appropriate links are provided for aliasing. IV. Interlinear files The corpora mul07 and mul13 consist of files with alternating elements in two languages. Their structure is identical to that of ordinary monolingual corpora. They have no aliases, as they have no monolingual components. V. Other cases The corpora mul10 and mul11 don't fit this picture well: mul10: Consists of multiple English translations of two French originals, so there are not an equal number of mfr10 and men10 .eci files. No aliases. mul11: Acquired so late no SGML markup done. No aliases. VI. Aliases When a multilingual corpus has separable monolingual components, these are independently linked into the appropriate eci[1-4] data directory, with corpus names composed of the appropriate language code and a code number which is formed by adding 10 to the corpus code number of the parent multilingual corpus. For example, fre13 is the name of the French component of mul03, and is linked to the appropriate subdirectory, i.e. mul03/mfr03. The point of all this is that e.g. eci1/fre*/*.eci is a pattern which will pick all and only the type 1 French .eci files, whether monolingual in origin or components of multilingual corpora.