Overview of ECI/MCI Data Directory Structure I. Introduction MCI is composed of 49 corpora in 25 languages, contained on a single CD-ROM. Some of these are multilingual corpora, with the result that there are 58 separate components of MCI. There are four broad categories of corpus, depending on the level of standardisation of their structure which has been possible. Each category corresponds to a subdirectory of the `data' directory at the top level of this CD. Each component is in turn contained within a one of those four subdirectories. This document describes the structure of the data directory and the directories and files contained therein. II. Uniform Naming Scheme II.1 File names Individual corpora and components are associated with five-character corpus names: three letters for source code plus two digits for uniqueness. The three letter source code indicates the primary language for the monolingual corpora, and is "mul" for the multilingual corpora. The source code for the individual language components of these corpora is composed of an "m" plus a two-letter language code. For example "dut02" is a monolingual Dutch corpus, "mul05" is a multilingual corpus and "mfr05" is the French component thereof. Sub-corpora are distinguished by a single letter suffix, and although the logical structure of the material is such that no individual corpus has more than a single level of substructure, for ease of manipulation large sub-corpora are sometimes split across a number of files, in which case a futher two digits are added. For example "dut02g02" is the second file of the 7th sub-corpus of the "dut02" corpus. Even in those few cases were a corpus is composed of a single file, the structure is as above, e.g. we have ger01a.eci, not ger01.eci. More details on the organisation of multilingual corpora can be found in the companion document mulstrct.txt. II.2 Extensions The majority of files in MCI have one of four extensions .eci data files .sgm SGML wrapper files .ent SGML entity declaration files .edt Documentation files Exceptions are the CPYRIGHT files which appear throughout, the top-level README file and the files underneath `original' directories (see below). III. The `data' Directory and its Subdirectories The `data' directory contains four subdirectories: `eci1', `eci2', `eci3' and `eci4'. Corpora which are SGML-compliant, which use the ECI dtd and whose internal document structure has been reasonably comprehensively captured in TEI-conformant markup are under `eci1'. Corpora which use the ECI dtd but whose structure is only captured to a limited extent (i.e. which although syntactically TEI-conformant are not semantically TEI-conformant) are under `eci2'. Corpora which are SGML-compliant but which do not use the ECI dtd are under `eci3'. Non-SGML corpora are under `eci4'. For all the corpora in eci1-3 except the very largest, the data as originally supplied to the ECI (sometimes post-conversion to ISO Latin 1) is contained unchanged (except where necessary as regards file and directory names) under a directory called `original', compressed using the GNU `gzip' utility. In order to respect the lowest-common-denominator filename restrictions, considerable surgery on names has been required. The vast majority of files under original had extensions `asc', `doc', `eci' or `txt'. Their compressed versions have extensions `agz', `dgz', `egz' and `tgz' respectively. Other long names and/or multile extensions have been abbreviated in as perspicuous a way as possible. III.1 eci1 This directory contains corpora which have been quite thoroughly processed, whose structure has been marked up with TEI-conformant SGML using the ECI dtd, and which observes the Text/Markup invariant (see the Editorial Declaration file for discussion). For each corpus in this category, there is a subdirectory of `eci1' with the same name as the corpus (see II.1 above). Within that directory one or more `.sgm' files provide SGML-mediated access to the corpus material. (See the bin/README for an introduction to using `sgmls' on ECI materials.) A `.edt' file provides details of the markup for the particular corpus. A `.ent' file parameterises the ECI dtd for the particular corpus, providing entity definitions which introduce corpus-specific details into the TEI header component. One or more `.eci' files contain the corpus material itself, with a prolog of SGML comments which give a brief description of the corpus. Where appropriate for reasons either of hetereogeneity of the material or of size, more than one `.sgm' file is present. In these cases in addition to the overall header material for the corpus as a whole, the `.ent' file will contain specialised header information for each of the sub-corpora associated with the subsidiary `.sgm' files and the `.eci' files they subsume. The structuring of the sub-corpora is transparently manifested in the file names, as discussed above (II.1). III.2 eci2 Corpora in this directory have not been processed to as high a standard as those in eci1. Although they are marked up according to the ECI dtd, there are significant aspects of document structure which have not been captured in SGML, indicated by e.g. tabs, blank lines, idiosyncratic tags or simply text semantics. The file structure of these corpora is as for those in `eci1'. III.3 eci3 Corpora in this directory do not use the ECI dtd at all, but rather the corpus provider contributed it already marked up using some other dtd, which we have adopted/adapted. Note that these corpora do not observe the Text/Markup invariant. File names have been converted to ECI usage, as described in II.1, with typically only `.sgm' and `.eci' files present. III.4 eci4 Corpora in this directory are not marked up using SGML at all. .sgm, .edt and .ent files are provided to give documentation and header information, and in some cases to at least provide some structuring to the data.