Overview of ECI/MCI Data Directory Structure

I.  Introduction

MCI is composed of 49 corpora in 25 languages, contained on a single
CD-ROM.  Some of these are multilingual corpora, with the result that
there are 58 separate components of MCI.  There are four broad
categories of corpus, depending on the level of standardisation of
their structure which has been possible.  Each category corresponds to
a subdirectory of the `data' directory at the top level of this CD.
Each component is in turn contained within a one of those four
subdirectories.  This document describes the structure of the data
directory and the directories and files contained therein.

II.  Uniform Naming Scheme

II.1 File names

Individual corpora and components are associated with five-character
corpus names: three letters for source code plus two digits for
uniqueness.  The three letter source code indicates the primary
language for the monolingual corpora, and is "mul" for the
multilingual corpora.  The source code for the individual language
components of these corpora is composed of an "m" plus a two-letter
language code.  For example "dut02" is a monolingual Dutch corpus,
"mul05" is a multilingual corpus and "mfr05" is the French component
thereof.

Sub-corpora are distinguished by a single letter suffix, and although
the logical structure of the material is such that no individual
corpus has more than a single level of substructure, for ease of
manipulation large sub-corpora are sometimes split across a number of
files, in which case a futher two digits are added.  For example
"dut02g02" is the second file of the 7th sub-corpus of the "dut02"
corpus.

Even in those few cases were a corpus is composed of a single file,
the structure is as above, e.g. we have ger01a.eci, not ger01.eci.

More details on the organisation of multilingual corpora can be found
in the companion document mulstrct.txt.

II.2 Extensions

The majority of files in MCI have one of four extensions

.eci data files

.sgm SGML wrapper files

.ent SGML entity declaration files

.edt Documentation files

Exceptions are the CPYRIGHT files which appear throughout, the
top-level README file and the files underneath `original' directories
(see below).

III.  The `data' Directory and its Subdirectories

The `data' directory contains four subdirectories: `eci1', `eci2',
`eci3' and `eci4'.  Corpora which are SGML-compliant, which use the
ECI dtd and whose internal document structure has been reasonably
comprehensively captured in TEI-conformant markup are under `eci1'.
Corpora which use the ECI dtd but whose structure is only captured to
a limited extent (i.e. which although syntactically TEI-conformant are
not semantically TEI-conformant) are under `eci2'.  Corpora which are
SGML-compliant but which do not use the ECI dtd are under `eci3'.
Non-SGML corpora are under `eci4'.

For all the corpora in eci1-3 except the very largest, the data as
originally supplied to the ECI (sometimes post-conversion to ISO Latin 1)
is contained unchanged (except where necessary as regards file and
directory names) under a directory called `original', compressed using
the GNU `gzip' utility.  In order to respect the
lowest-common-denominator filename restrictions, considerable surgery
on names has been required.  The vast majority of files under original
had extensions `asc', `doc', `eci' or `txt'.  Their compressed
versions have extensions `agz', `dgz', `egz' and `tgz' respectively.
Other long names and/or multile extensions have been abbreviated in as
perspicuous a way as possible.

III.1 eci1

This directory contains corpora which have been quite thoroughly
processed, whose structure has been marked up with TEI-conformant SGML
using the ECI dtd, and which observes the Text/Markup invariant (see
the Editorial Declaration file for discussion).  For each corpus in
this category, there is a subdirectory of `eci1' with the same name
as the corpus (see II.1 above).  Within that directory one or more
`.sgm' files provide SGML-mediated access to the corpus material.
(See the bin/README for an introduction to using `sgmls' on
ECI materials.)  A `.edt' file provides details of the markup for the
particular corpus.  A `.ent' file parameterises the ECI dtd for the
particular corpus, providing entity definitions which introduce
corpus-specific details into the TEI header component.  One or more
`.eci' files contain the corpus material itself, with a prolog of SGML
comments which give a brief description of the corpus.

Where appropriate for reasons either of hetereogeneity of the material
or of size, more than one `.sgm' file is present.  In these cases in
addition to the overall header material for the corpus as a whole, the
`.ent' file will contain specialised header information for each of
the sub-corpora associated with the subsidiary `.sgm' files and the
`.eci' files they subsume.  The structuring of the sub-corpora is
transparently manifested in the file names, as discussed above (II.1).

III.2 eci2

Corpora in this directory have not been processed to as high a
standard as those in eci1.  Although they are marked up according to
the ECI dtd, there are significant aspects of document structure which
have not been captured in SGML, indicated by e.g. tabs, blank lines,
idiosyncratic tags or simply text semantics.  The file structure of
these corpora is as for those in `eci1'.

III.3 eci3

Corpora in this directory do not use the ECI dtd at all, but rather
the corpus provider contributed it already marked up using some other
dtd, which we have adopted/adapted.  Note that these corpora do not
observe the Text/Markup invariant.  File names have been converted to
ECI usage, as described in II.1, with typically only `.sgm' and `.eci'
files present.

III.4 eci4

Corpora in this directory are not marked up using SGML at all.  .sgm,
.edt and .ent files are provided to give documentation and header
information, and in some cases to at least provide some structuring to
the data.