ECI README file.

This is the European Corpus Initiative Multilingual Corpus I

All use of this corpus is subject to a licence agreement: see the file LICENCE.

The ECI is a volunteer effort, sponsored by the Association for Computational Linguistics (European Chapter), carried out at the Human Communication Research Centre, University of Edinburgh (HCRC) and Institute Dalle Molle pour les etudes semantique et cognitives, University of Geneva (ISSCO), with modest additional financial support from the European Network in Language and Speech (ELSNET) and the Network for European Reference Corpora (NERC).

We are very grateful to all those who made material available for this effort, none of whom received any compensation for their contributions: Without their generosity and prior effort, there would be no ECI/MCI. Please respect the restrictions, if any, they have specified on the use of their contributions -- these are recorded both in the top-level LICENCE file and in the CPYRIGHT files and headers of the individual corpora themselves.

I. Directory summary

bin:

msdos Contains MSDOS .exe files of some basic tools (gzip,perl and sgmls) unix Contains UN*X scripts for corpus manipulation

data:

Contains the actual corpus data

doc:

Documentation files

lib:

fonts .bdf versions of fonts for ISO-LATIN-5 and -7 (Cyrillic and Greek) tei ECI and TEI files needed for SGML applications

src:

mac: BinHexed version of gzip
msdos: PKZipped sources of gzip, perl and sgmls
perl: Perl scripts for corpus manipulation
unix: sources for gzip and sgmls

II. The Corpus Data

The actual data is in files with a .eci suffix two levels down below the data directory: For information about the directory structure of this directory, see the file doc/dirstrct.txt, for a listing of the titles of the corpora themselves see mci.bib and for a brief summary of their contents see doc/corpdesc.txt.

Most of the data is marked up in TEI-compliant SGML -- see mci.edt for discussion, and the bin and src directories for tools to assist in processing and accessing the data. The top-level file mci.sgm provides an SGML way in to the corpus as a whole, or for selected parts of it -- again see mci.edt for further instructions.

Do not despair if you either have no interest in SGML markup, or no facilities for exploiting it, but just want 'plain text': The bulk of the data provided here, including all that under directories data/eci1 and data/eci2, observes what we call the Text/Markup Invariant:

Every line in a data file (.eci file) is either all text or all markup, and a line is a markup line if and only if it begins with a left angle bracket (<). This makes restricting your processing to 'plain text' very easy -- just look only at lines which begin with some character other than <.

The UN*X shell script bin/unix/textonly, introduced below, both implements this for UN*X users and documents it for others.

Note that the file lib/tei/tei.dcl is an SGML declaration which is required for any SGML application processing ECI/MCI files.

III. Character Sets

The majority of the data in ECI/MCI is encoded using the ISO-8859-1 (ISO Latin 1) character set. Some use is also made of ISO Latin 2 (for Czech), ISO Latin 5 (Cyrillic, for Bulgarian and Russian) and 7 (Greek). All of these character sets have 256 characters, i.e. they use 8 bits per character. They are also all virtually identical to ASCII for the first 128 character codes. Some support is provided, mostly for UN*X environments, for displaying and printing the full character inventory of these character sets -- see src/unix/isoscrpt and lib/fonts for more information.

IV. Examples of use on a UN*X system:

To see just the header information on the corpus as a whole, type

eci mci.sgm

To process the ENTIRE corpus through the sgmls program, type

eci -iall mci.sgm

at the top level.

To process all the Spanish corpus components similarly, type

eci -ispa mci.sgm

at the top level.

To retrieve just the text of all the German components, type

textonly data/*/ger*/*.eci

at the top level.

Note that all the above require that the environment variable ECI_ROOT be set to the full pathname of the CD (e.g. "/cdrom"), that your working directory is in fact $ECI_ROOT and that $ECI_ROOT/bin/unix is in your path.

V. Finally, "Caveat Lector"

These corpora came to us in every conceivable format, character set, state of existing preparation and markup. Previous annotation sometimes introduced errors that we have missed, and we know that we have introduced some errors of our own, despite our best efforts. This is almost inevitable since the size of these corpora required us to use semi-automatic markup and correction schemes. None-the-less we believe that our efforts have added more value than they have taken away, and hope the results will be of use.

We would be glad to hear of errors that you discover when using these corpora, or to receive tools for aiding in their exploitation -- send e-mail to eucorp@cogsci.ed.ac.uk. We will endeavour to keep all recipients of the ECI/MCI informed of any such submissions.