Celex readme

Celex

The greater part of this file is drawn from the main README file on the CELEX CD-ROM.

The CD-ROM Version of the CELEX Lexical Database

Table of Contents:

CELEX
The Databases
1995 Extensions and Revisions
The User Guide
Copyright
Directory Structure
On Using AWK to Process the Lexical Files
File Sizes
Organization
Original Sources of the CELEX Database
A Brief Overview of the English Data on the CD-ROM
Known Bugs and Errors on the CD (Second Release, 1995)

CELEX

CELEX is the Dutch Centre for Lexical Information. It was developed as a joint enterprise of the University of Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck Institute for Psycholinguistics in Nijmegen, and the Institute for Perception Research in Eindhoven. Over the years it has been funded mainly by the Netherlands Organisation for Scientific Research (NWO) and the Dutch Ministry of Science and Education. CELEX is now part of the Max Planck Institute for Psycholinguistics.

The Databases

This CD-ROM contains plain ASCII versions of the CELEX lexical databases of English (version 2.5), Dutch (version 3.1) and German (version 2.5). The original CELEX databases can be consulted interactively either by using the SQL*PLUS query language within an ORACLE RDBMS environment, or by means of the specially designed user interface FLEX. As the FLEX interface has been written to communicate with the underlying UNIX operating system and the ORACLE software, it is completely bound to this particular configuration and hence cannot be distributed separately and does not feature on the CD-ROM.

To make for greater compatibility with other operating systems, the databases on the CD-ROM have not been tailored to fit any particular database management program. Instead, the information is presented in a series of plain ASCII files that can be queried with tools such as AWK and ICON. Unique identity numbers allow the linking of information from different files.

As in the original databases, some kinds of information have to be computed on-line. Wherever necessary, AWK scripts have been provided to recover this information. Also, some C-programs have been included, along with their MS-DOS executables and HP-UX (Hewlett-Packard UNIX) binaries. README files specify the details of their use.

The CD-ROM is mastered using the ISO 9660 data format, with the Rock Ridge extensions, allowing it to be used in VMS, MS-DOS(*), Macintosh (**) and UNIX(***) environments.

Anyone who would like to purchase the CD-ROM should send a check or purchase order made payable to the "Trustees of the University of Pennsylvania" to

   Linguistic Data Consortium
   441 Williams Hall
   University of Pennsylvania
   Philadelphia, PA 19104-6305
   ldc@unagi.cis.upenn.edu
   Tel: +1/215/898-0464 Fax: +1/215/573-2175

(*) PC-users may experience some difficulty reading the README files on the CD, as these were compiled on a UNIX system and as a consequence lack carriage return characters at the end of lines. This may result in a near-continuous string of characters, broken only by linefeeds, running off the righthand side of the screen. Reading these files into word processors like WordPerfect or MS-Word will usually correct this, as these automatically detect and convert the linefeeds found in the text. The same problem will appear when scrolling the lexical data files, but this lack of carriage returns will not affect the functioning of the AWK programs. Because of the large size of these files, we do not recommend reading lexical data files into word processors. Try to retrieve the items you want by using the predefined AWK-scripts or more-or-less standard DOS tools such as 'find' and 'grep' for retrieving single records.

(**) If someone has a Mac with a CD-ROM drive that was obtained before 12/92, and has not installed any system upgrades since that date, then that system will not be able to read the CELEX CD-ROM. In such a case, all that is needed is to obtain the upgraded driver software (a very small amount of code), and copy it onto the system in place of the existing driver. The upgrade can be obtained as follows:

Connect to ftp server: ftp.apple.com
Go to directory: dts/mac/sys.soft/cdrom
Get file: cd-rom-setup

(***) Some UNIX systems will have trouble displaying the contents of the lexical data and documentation files due to the presence of a semicolon plus a version number at the end of the filetype. These have been included to conform to the ISO 9660 standard required for CD-ROM production. If your system interprets the semicolon as a command line delimiter, either escape the semicolon with a backslash or access the file by replacing the semicolon and number with wildcards, e.g. two question marks.

1995 Extensions & Revisions

About 1000 new German lemma entries.
Revised German morphological parses, verb argument structures, and inflectional paradigm codes.
Inclusion of a German Corpus Type Lexicon.
Frequency counts included in all lemma and wordform files for ease of use.
Preferred spelling variants always given as main entries.
Syllable frequency counts for Dutch and English.
Full awk scripts supplied instead of isolated functions.
Provision of efficient, index-based C-program for joining wordform and lemma files.
Complete version of the German Linguistic Guide.
User Guide files in both European A4 and American Letter PostScript format.

The User Guide

The CELEX User Guide describes in detail the kinds of information in the databases using unique labels for each column in the RDB. For instance, the syllabified phonological headword information for English lemmas, with stress markers, in the CPA character set, is referred to as "PhonStrsCPA".

The README files describing the columns of the lexical files on this CD-ROM use the same labels. In order to facilitate the use of this CD-ROM, the relevant sections in the User Guide on English, German and Dutch have been made available as PostScript files in both European A4 and American Letter format. Users preferring a bound hardcopy of the CELEX User Guide (Dfl 115,--) should send a request by electronic mail to

   celex@mpi.nl (INTERNET)

or by surface mail to

   
   Richard Piepenbrock (CELEX project manager)
   Max Planck Institute for Psycholinguistics
   P.O. Box 310
   NL-6500 AH Nijmegen
   THE NETHERLANDS

P.S. For additional information (also on the on-line database version) and the latest news on updates and the like, you are invited to consult the CELEX homepage on the World Wide Web at

   http://www.kun.nl/celex/

Copyright

LICENCE: The copyright holder grants to the purchaser of this CD-ROM unrestricted license to use all the lexical information included herein for research purposes only, subject to the following restrictions:

No onward distribution of the lexical data is allowed -- copies may be made only for use by the purchaser and her/his research group, for easy of use by that group, etc.;
The contribution of CELEX is acknowledged in any public presentation or publication of any work based on the lexicons. (*)

CELEX carries no warranty of any kind.

(*) This CD-ROM should be referred to as: R. H. Baayen, R. Piepenbrock & L. Gulikers, The CELEX Lexical Database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, 1995.

Directory Structure

root:
- README (the README file you are reading now)
- english (directory containing the English database)
- german (directory containing the German database)
- dutch (directory containing the Dutch database)
- awk (directory containing awk scripts for joining columns of different files)
- intro_a4.ps (PostScript file with the introduction to CELEX of the CELEX User Guide, in European A4-format)
- intro_let.ps (PostScript file with the introduction to CELEX of the CELEX User Guide, in American Letter format)
english:
- README
- eol (directory, English Orthography, Lemmas)
- epl (directory, English Phonology, Lemmas)
- eml (directory, English Morphology, Lemmas)
- esl (directory, English Syntax, Lemmas)
- efl (directory, English Frequency, Lemmas)
- eow (directory, English Orthography, Wordforms)
- epw (directory, English Phonology, Wordforms)
- emw (directory, English Morphology, Wordforms)
- efw (directory, English Frequency, Wordforms)
- ect (directory, English Corpus Types)
- efs (directory, English Frequency, Syllables)
- eug_a4.ps (PostScript file of the CELEX User Guide on English, in European A4-format)
- eug_let.ps (PostScript file of the CELEX User Guide on English, in American Letter format)
german:
- README
- gol (directory, German Orthography, Lemmas)
- gpl (directory, German Phonology, Lemmas)
- gml (directory, German Morphology, Lemmas)
- gsl (directory, German Syntax, Lemmas)
- gfl (directory, German Frequency, Lemmas)
- gow (directory, German Orthography, Wordforms)
- gpw (directory, German Phonology, Wordforms)
- gmw (directory, German Morphology, Wordforms)
- gfw (directory, German Frequency, Wordforms)
- gct (directory, German Corpus Types)
- gug_a4.ps (PostScript file of the CELEX User Guide on German, in European A4-format)
- gug_let.ps (PostScript file of the CELEX User Guide on German, in American Letter format)
dutch:
- README
- dol (directory, Dutch Orthography, Lemmas)
- dpl (directory, Dutch Phonology, Lemmas)
- dml (directory, Dutch Morphology, Lemmas)
- dsl (directory, Dutch Syntax, Lemmas)
- dfl (directory, Dutch Frequency, Lemmas)
- dow (directory, Dutch Orthography, Wordforms)
- dpw (directory, Dutch Phonology, Wordforms)
- dmw (directory, Dutch Morphology, Wordforms)
- dfw (directory, Dutch Frequency, Wordforms)
- dct (directory, Dutch Corpus Types)
- dab (directory, Dutch Abbreviations)
- dfs (directory, Dutch Frequency, Syllables)
- dug_a4.ps (PostScript file of the CELEX User Guide on Dutch, in European A4-format)
- dug_let.ps (PostScript file of the CELEX User Guide on Dutch, in American Letter format)
Each of these directories has the following structure:
- README (a help file)
- [egd][opmsf..][lw..].cd (the database file)
- awk (directory containing AWK functions)
- c (directory with a C program, if present)

On Using AWK to Process the Lexical Files

File Structure

All lexicon files are characterized by the following properties:

The field separator is the backslash ("\").
The record separator is the newline character. Due to the different implementation of this character on various platforms (linefeed and/or carriage return) records may run off the screen when displayed using commands like 'type' or 'cat'. Reading (parts of) files into a standard editor or retrieving them with the AWK 'print' statement will solve this problem.
The first field contains the unique identity number IdNum, which can be used to join columns of different lexicons.
Each file contains the standard spelling of all its entries in order to maintain immediate interpretability.
Note that fields may be empty (for instance, monomorphemic words have empty fields for the morphological constituency information).
Variants (spellings, pronunciation, alternative parsings) of the same lexical entry are always listed on a single line, fixed column positions specifying the alternatives. Where this convention could not be followed, separate files (described in the accompanying README files) with the additional variants have been added.

Tools for Processing the Files

Columns can be selected from the lexical files using tools such as AWK (A.V.Aho, B.W.Kernigan & P.J.Weinberger, The AWK Programming Language, New York: Addison-Wesley, 1988) or ICON (R.E.Griswold & M.T.Griswold, The Icon Programming Language, Englewoods Cliffs, New Jersey: Prentice-Hall, 1990). As all AWK scripts on the CD have been interpreted and tested using GAWK (GNU AWK) version 2.15, patchlevel 5 on UNIX, and GAWK (GNU AWK) version 2.15, patchlevel 6 (16 bit version) on MS-DOS, we recommend using this freely available version of AWK to recover the lexical data (although it should be stated that we have no particular interest in promoting GAWK or singing the praises of its authors!).

GAWK is available in a number of versions from various ftp-sites:

OS/2 32-bit version: ftp-os2.cdrom.com:pub/os2/32bit/unix/gnuawk.zip
OS/2 16-bit version: ftp-os2.cdrom.com:pub/os2/16bit/unix/gawk2156.zip
DOS 16-bit, and OS/2 and DOS 32-bit version (gawk.exe, gawk-emx.exe): oak.oakland.edu:SimTel/msdos/awk/ with executables in gawk2156.zip and sources in gawk215t.zip
The original GNU sources: prep.ai.mit.edu:pub/gnu/gawk-2.15.6.tar.gz

Alternative sites of interest to European users are ftp.tu-ilmenau.de and ftp.uni-tuebingen.de (both in Germany), both under directory 'pub/gnu', and src.doc.ic.ac.uk (SunSITE London), under directory 'gnu'. Asian users may refer to ftp.uec.ac.jp/pub/wwfs/GNU in Tokyo, Japan.

The complete manual (200 pages) for GNU awk is available with the sources. A printed manual is available from the Free Software Foundation. You can find information on GNU's manuals, disks, and the GNU project in the GNU's Bulletin, available on the newsgroup gnu.announce, or by sending a self-addressed stamped envelope ($0.52) to

   Free Software Foundation
   675 Massachusetts Avenue
   Cambridge, MA 02139

Computing the Representation of Missing Fields

Many lexicon files on this CD-ROM do not contain all the columns specified in the CELEX User Guide. However, all missing fields can be recovered from the CD-ROM files using the AWK scripts in the awk directories, as specified in the corresponding README files. (We have opted for AWK rather than ICON in view of the greater availability of AWK as a standard tool of the UNIX operating system.) The use of AWK scripts rather than providing a full listing of all lexical information is motivated by the following considerations:

The original CELEX databases do not store all columns either, and similarly require on-line computation of lexical fields.
The lexical files on CD-ROM never exceed 28 megabytes in size. If so required, any file can be downloaded from the CD-ROM for efficient local processing. If all columns had been fully spelled out, the files would have been prohibitively large.
The lexical files on CD-ROM have a transparent column structure that can be inspected by UNIX utilities such as head, more, or less, and even with a text editor such as vi or emacs, mostly without loss of transparency due to excessive line overflow.
Accessing a CD-ROM is relatively time-consuming, suggesting that local computation of derived information is to be preferred.

Whenever on-line derivation of missing fields is required, users can run the ready-made AWK scripts (*) in the awk subdirectory of the corresponding lemma or wordform files. This is done as follows:

   awk -f scriptname.awk LexiconFile LexField

The LexField indicates the field in the lexicon file from which the additional representation should be computed, e.g.

   awk -f sortstr.awk eol.cd 2

to process the alphabetically sorted representation of the Headword spelling (HeadLowSort) from the HeadDia field (2).

In the case of the phonological representation and the syntactic codes, extra arguments are required, which is specified in the READMEs of their respective subdirectories.

Generally speaking, the scripts only yield the missing representation of one particular field or column, e.g. only the HeadLowSort and no other fields. This can be easily modified, however, by copying the script to your hard disk and adapting the 'printf' statement to include more fields, for example:

printf("%s\n",LexInfo_1); => printf("%s\\%s\\%s\\%s\n",$1,$2,$3,LexInfo_1);

to retrieve fields 1, 2 and 3 as well, each separated by backslashes (LexInfo_1 always refers to the LexField supplied on the command line, as the field that should be converted rather than just retrieved.

(*) The first release of this CD-ROM contained only isolated AWK-functions without 'BEGIN'-segments or print-statements. The original functions still feature in the body of the current AWK-scripts, however, so they can be easily extracted and modified or combined to perform user-defined operations not 'pre-cooked' by us.

Joining Files of Similar Lexical Status

It is often necessary to combine information in files located in different directories. To join columns of different files, the unique identity numbers (IdNum) listed in the first field of every lexical file can serve as linking keys. PLEASE NOTE THAT NO OTHER FIELDS SHOULD BE USED! Homographs, homophones, and other spuriously identical fields may wreak havoc with your intended personal lexicon when other join keys are used. Also note that the wordform lexicons have TWO unique identity numbers, one providing a unique wordform description, to be used for linking wordform lexicons, the other providing a link to the lemma lexicons, allowing one to link wordforms with the lexical information on their corresponding lemmas.

An AWK script specifically designed to join columns of two (large) input files OF SIMILAR LEXICAL STATUS (two lemma files or two wordform files) can be found in the awk subdirectory of the present directory. The join.awk script is called as follows:

  awk -f join.awk file1 file2 LexField_file1 LexField_file2 > your_result

File1 and file2 denote the two input files. LexField_file1 and LexField_file2 specify the field numbers of the lexical information in file1 and file2 that is to be combined in the output file "your_result". This file has the format

  IdNum \\ LexInfo_1 TAB LexInfo_2

This format allows new information from a third file to be joined with the file "your_result":

  awk -f join.awk your_result file3 2 LexField_file3 > your_result2

will result in a file with the fields

  IdNum \\ LexInfo_1 TAB LexInfo_2 TAB LexInfo_3

In this way information from several lexicons can be combined by successive application of join.awk.

For example, if the Dutch PhonStrsDISC column (field 4 in dpl.cd) is to be combined with the Immediate morphological segmentation Imm (field 9 of dml.cd) and the word category ClassNum (field 4 of dsl.cd), the join.awk script should then be called as follows:

  awk -f join.awk dpl.cd dml.cd 4 9 > tmp
  awk -f join.awk tmp dsl.cd 2 4 > "your_result"
  rm tmp

The join.awk script presupposes that both input files are sorted by their IdNum, with the IdNum being the first field, as is the case for all lexical files.

Joining Files of Dissimilar Lexical Status

Special care is required for joining a wordform lexicon with a lemma lexicon. The lemma unique identity numbers in the wordform lexicon (IdNumLemma) are NOT sorted numerically (these files are sorted by the wordform IdNum). Hence join.awk cannot be used. In order to join a wordform lexicon with a lemma lexicon, a C-program is included in the c subdirectory of the present directory. Using an AWK script for this purpose was found to be unfeasible, as this caused the system to run out of memory even when run on a UNIX machine.

The program, joinwl.c, makes use of an index file recording the initial byte position of every single line in the lemma file, thus speeding up the retrieval process. It comes in TWO VERSIONS, joinwlle.c for so-called Little-Endian machines, and joinwlbe.c for Big-Endian machines. The Endian status of the computer depends on the way the processor interprets multi-byte data types, which (unfortunately!) is important as the byte offsets in the index files are given in four-byte values. You have to determine the Endian status of your machine BEFORE applying either the joinwlle.c or the joinwlbe.c program by running the endian.c program provided in the same directory.

The index files, with the same name as the lemma lexicon and the extension 'idx', such as 'dol.idx' for Dutch orthography, are provided alongside the lemma lexicons.

The program is called as follows:

  joinwl(le|be) wordformfile lemmafile "wf_field[|wf_field]" "lm_field[|lm_field]"

The wordformfile and the lemmafile here denote the two input files. The third argument should contain the numbers of one or more fields to be retrieved from the wordform file, while the fourth argument specifies the field numbers from the lemma file. For each argument, field numbers should be enclosed in double quotes, and separated by vertical bars. The result is written to an output file in your current directory with the name of the wordform file and the extension 'out'. Note that specification of the double quotes is obligatory, and that the wordform lexicon should be specified BEFORE the lemma lexicon.

Thus, if the WordSylDia information of the Dutch wordform lexicon (field 9) and its frequency (field 3) should be combined with the HeadSylDia information of the lemma lexicon (8) and its respective frequency count (3), the program should be called as follows:

  joinwl(le|be) dow.cd dol.cd "9|3" "8|3"

The output of this program is automatically written to the file 'dow.out' in your working directory.

If combining the two files should fail due to platform-specific interpretation of the initial byte positions (on account of different newline characters e.g), or if you want to combine self-created files derived from the original files, you can generate an index yourself with the 'make_idx' program in the same directory. This should be run as follows:

  make_idx lemmafile 1

The '1' here denotes the field number of the unique IdNum key, which is specified as the first field by default.

Filesizes


   dutch/dab/dab.cd         69399 bytes
   dutch/dct/dct.cd       4107513 bytes
   dutch/dfl/dfl.cd       3574188 bytes
   dutch/dfs/dfs.cd        366605 bytes
   dutch/dfw/dfw.cd      12527494 bytes
   dutch/dml/dml.cd      10433870 bytes
   dutch/dmw/dmw.cd      11159555 bytes
   dutch/dol/dol.cd      10229951 bytes
   dutch/dow/dow.cd      19185443 bytes
   dutch/dpl/dpl.cd      16300823 bytes
   dutch/dpw/dpw.cd      28837473 bytes
   dutch/dsl/dsl.cd       4092534 bytes
   
   english/ect/ect.cd     5861603 bytes
   english/efl/efl.cd     2143469 bytes
   english/efs/efs.cd      263859 bytes
   english/efw/efw.cd     7158616 bytes
   english/eml/eml.cd     4918610 bytes
   english/emw/emw.cd     4936327 bytes
   english/eol/eol.cd     2068262 bytes
   english/eow/eow.cd     7351883 bytes
   english/epl/epl.cd     5480381 bytes
   english/epw/epw.cd    15567897 bytes
   english/esl/esl.cd     5560364 bytes
   
   german/gct/gct.cd      7481411 bytes
   german/gfl/gfl.cd      2155299 bytes
   german/gfw/gfw.cd     17283964 bytes
   german/gml/gml.cd      4941770 bytes
   german/gmw/gmw.cd     12342460 bytes
   german/gol/gol.cd      3095599 bytes
   german/gow/gow.cd     15932928 bytes
   german/gpl/gpl.cd      6669899 bytes
   german/gpw/gpw.cd     28982785 bytes
   german/gsl/gsl.cd      2537557 bytes
                         -------------- +
                        283619791 bytes  (283,6 meg)

Organization

Project Manager CELEX 1985-1992: H. Kerkman
Project Manager CELEX 1993- : R. Piepenbrock
Original Programming Concept : Dr. R. H. Baayen
1995 Extensions & Revisions : R. Piepenbrock & L. Gulikers, aided and abetted by H. Drexler
Cover Design : I. Doehring

CELEX Board:

Dr. W. J. M. Levelt (Max Planck Institute for Psycholinguistics, chair)
Dr. R. Collier (Institute for Perception Research)
Dr. R. Schreuder (Nijmegen University)
Dr. P. van Sterkenburg (Institute for Dutch Lexicology)

Many people have been involved in the CELEX project through the years. The list given below is an attempt to give an exhaustive overview of their respective contributions, although it must be added that dividing lines between these disciplines have by no means been strict.

Database and user interface design:

Eric Willems
Eddy Bronkhorst
Marcel van der Peijl
Maurice van Hinsberg
Vincent Karthaus
Hans Kerkman
Domien Kusters
Hans Drexler

System and database management:

Christa Hausmann-Jamin
Marcel Bingley
Ger Cox
Cees van der Veer
Richard Piepenbrock

The Dutch lexicon:

Hans Kerkman
Domien Kusters
Veronique Remmelts-Oudkerk
Ton van der Wouden
Gilbert Rattink
Wim Peters

The English lexicon:

Francoise Keulen
Domien Kusters
Gavin Burnage
Richard Piepenbrock
Caroline den Os

The German lexicon:

Gilbert Rattink
Leon Gulikers
Richard Piepenbrock

Original Sources of the CELEX Database

An overview of the sources used to compile the CELEX database, specified for each language.

DUTCH

Van Dale's Comprehensive Dictionary of Contemporary Dutch (1984):
- 80,000 lemmata
Word List of the Dutch Language ('Groene Boekje') (1954), revised version:
- 65,000 lemmata
The most frequent lemmata from the text corpus of the Institute for Dutch Lexicology (INL) (42,380,000 words in all):
- 15,000 lemmata

Note that there is a considerable overlap between the two dictionary sources (of approx. 45,000 lemmata). Other lemmata were added to enable morphological decomposition of the basic set of lemmata.

When compared with the INL text corpus, the coverage of CELEX-lemmata is 95% of the total corpus. This figure is fairly skewed, because in order to reduce the bulk of the the INL corpus type list, all hapax legomena (which included many OCR (scanning) errors) were omitted and therefore cannot be retrieved from the corpus type list. The current corpus type list comprises 211,389 entries, as opposed to the original 321,000 inclusive of the hapax legomena.

The INL corpus, in the version derived for calculating the CELEX frequencies, consists of 930 entire fiction and non-fiction books (approx. 30% fiction, 70% non-fiction) published between 1970 and 1988. Newspapers, magazines, children's books, textbooks and specialist literature do not feature in the collection. The CELEX version is static, although the INL corpus continues to be expanded.

(See S. Hazenberg (1994). Een Keur van Woorden. VU Amsterdam. (p. 36 ff.) and J.G. Kruyt (1995). 'Nationale Tekstcorpora in Internationaal Perspectief'. Forum der Letteren, 36-1, 47-58).

ENGLISH

Oxford Advanced Learner's Dictionary (1974):
- 41,000 lemmata
Longman Dictionary of Contemporary English (1978):
- 53,000 lemmata

Note that there is a considerable overlap between the two dictionary sources (of approx. 30,000 lemmata). Other lemmata were added to enable morphological decomposition of the basic set of lemmata.

No extra lemmata have as yet been added from a text corpus. Nevertheless, when compared with the 17.9 million word corpus of Birmingham University/COBUILD, the coverage of CELEX-lemmata is 92% of the total corpus.

The 17.9 million token COBUILD/Birmingham corpus, on which the CELEX frequencies have been based, contains some American English texts. These are exclusively written texts (from a total of 16.6m tokens), and not spoken ones (from a total of 1.3m). There are 284 written texts in all, 44 of which are of indisputably American origin. These American texts make up 15.4% of the total written corpus. Other texts are by authors originating from countries with other English dialects, or are difficult to classify, such as Alistair Cooke.

Paradoxically, even when a text sample is of American origin, the spelling has nearly always been adapted to a British English standard, as the corpus was compiled in England where they had to make do with British English editions of these texts. Thus Irwin Shaw's 'Rich Man Poor Man' appears in a British English edition. In this respect, we were constrained by the limitations of the sources as offered to us.

As for the wordform and lemma-based frequencies, we have not taken the sources into account, as the disambiguation of highly frequent words would necessitate lengthy, labour-intensive rounds of coding. Therefore, we have taken random samples of these types, disambiguated these, and extrapolated the results for the total type frequency. This means we have lost the connection to distinct textual categories like 'newspaper' or 'American English'.

When a type is not ambiguous, however, you can go to the CD-ROM directory /english/ect, where the fifth field (FreqWA) of the file ect.cd gives you the exact written American frequency (the ect.cd file contains raw string frequencies only).

For the ambiguous types, it might be the case that the division BrE-AmE gives you some idea of the relative frequencies for each wordform. In most cases, clearly, this cannot be done with anything approaching certainty.

For a detailed account of the text samples making up the COBUILD/Birmingham corpus, see

J.M. Sinclair (Ed.) (1987). Looking Up. London/Glasgow: Collins/COBUILD. (The corpus used by CELEX is an extended version of the Reserve Corpus described on p. 10 ff.)

GERMAN

Bonnlex, tapes supplied by the Institute for Communication Research and Phonetics in Bonn
Molex, tapes supplied by the Institute for German Language in Mannheim
Noetic Circle Services (MIT) German spelling lexicon

As all sources were genuine computer data rather than electronic versions (or typesetting tapes) of paper dictionaries as was the case with the sources for the other languages, and all contained a variety of flections, stems and lemmata, no figures can be given as to what lemmata derive from exactly which source. In most cases, inflectional data from one tape were merged with stem information from another.

Other lemmata were added to enable morphological decomposition of the basic set of lemmata. No extra lemmata have as yet been added from a text corpus. When compared with the 6 million word corpus of the Institute for German Language at Mannheim, the coverage of CELEX lemmata is 83% of the total corpus.

The corpus used by CELEX for deriving the German (as yetundisambiguated) lemma and wordform frequencies consists of 5.4 million German tokens from written texts like newspapers, fiction and non-fiction, and 600,000 tokens of transcribed speech. The former is a combination of the 'Mannheimer Korpus I', 'Mannheimer Korpus II' and the 'Bonner Zeitungskorpus 1', while the latter is known as the 'Freiburger Korpus'. All of these can also be consulted on request by remote login to the Institut fuer Deutsche Sprache in Mannheim through the COSMAS interface. The corpus is relatively ill-balanced, especially in view of its small size, since novels like Grass' 'Die Blechtrommel' and Boell's 'Ansichten eines Clowns' are included in their entirety. All texts in the corpus were published or recorded between 1949 - 1975.

A Brief Overview of the English Data on the CD-ROM

When starting to use the English database, the user first has to choose between three so-called `lexicon types':

a lemma lexicon
a wordform lexicon
a corpus lexicon

Each lexicon type uses a specific kind of entry. The CELEX lemma lexicon is the one most similar to an ordinary dictionary since every entry in this lexicon represents a set of related inflected words. In a lexicon, a lemma can be represented by using a headword (cf. traditional dictionary entries) such as, for example, `call' or `cat'. The wordform lexicon yields all possible inflected words: every entry in the lexicon is an inflectional variant of the related headword or stem. So, a wordform lexicon contains words like `call', `calls', `calling', `called', `cat', `cats' and so on. A corpus type lexicon, on the other hand, simply gives you an ordered list of all alphanumeric strings found in the corpus with raw string counts, undisambiguated for relations to either lemmas or wordforms.

For all types of lexicons, the user may subsequently select any number of columns -- from a total of approximately 950(!) database columns -- combining information on the orthography, phonology, morphology, syntax and frequency of the entries. The information sheet `Lexical Data, English' summarizes the types of information available. An exhaustive overview of the columns available is given in the CELEX User Guide.

LEXICAL DATA, ENGLISH

The lexical data that can be selected for each entry in the different English lexicon types can be divided into five categories: orthography, phonology, morphology, syntax and frequency. In a separate section, example data are given for each of these categories.

Orthography (spelling)
- with or without diacritics
- with or without word division positions
- alternative spellings
- number of letters/syllables
Phonology (pronunciation) - phonetic transcriptions (using SAMPA notation or Computer Phonetic Alphabet (CPA) notation) with:
- syllable boundaries
- primary and secondary stress markers
- consonant-vowel patterns
- number of phonemes/syllables
- alternative pronunciations
Morphology (word structure)
- Derivational/compositional:
- division into stems and affixes
- flat or hierarchical representations
- Inflectional:
- stems and their inflections
Syntax (grammar)
- word class
- subcategorisations per word class
Frequency
- COBUILD frequency(*)

(*)These frequency data are based on the COBUILD corpus (sized 18 million words) built up by the University of Birmingham, Great Britain.

EXAMPLE DATA, ENGLISH

An arbitrary query using a small English lemma lexicon (that is, one with very few columns) might yield the following result:


   -----------------------------------------------------------
   Headword    Pronunciation    Morphology:         M: Cl Freq
                                Structure           Cl
   ----------- ---------------- ------------------- -- -- ----
   celebrant   "sE-lI-br@nt     ((celebrate),(ant)) Vx N     6
   celebration %sE-lI-"breI-Sn, ((celebrate),(ion)) Vx N   201
   cell        "sEl             (cell)              N  N  1210
   cellar      "sE-l@r*         (cellar)            N  N   228
   cellarage   "sE-l@-rIdZ      ((cellar),(age))    Nx N     0
   cellist     "tSE-lIst        ((cello),(ist))     Nx N     5
   cello       "tSE-l@U         (cello)             N  N    25
   cellular    "sEl-jU-l@r*     ((cell),(ular))     Nx A    21
   celluloid   "sEl-jU-lOId     ((cellulose),(oid)) Nx N    29
   -----------------------------------------------------------

An example selection from a small English wordform lexicon, showing the inflectional variants of the headwords given in the previous example, is presented in the next table:


   -----------------------------------------------------------
   Word         Word division   Pronunciation     Cl Type Freq
   ------------ --------------- ----------------- -- ---- ----
   celebrant    cel-e-brant     "sE-lI-br@nt      N  sing    2
   celebrants   cel-e-brants    "sE-lI-br@nts     N  plu     4
   celebration  cel-e-bra-tion  %sE-lI-"breI-Sn,  N  sing  144
   celebrations cel-e-bra-tions %sE-lI-"breI-Sn,z N  plu    57
   cell         cell            "sEl              N  sing  655
   cells        cells           "sElz             N  plu   555
   cellar       cel-lar         "sE-l@r*          N  sing  187
   cellars      cel-lars        "sE-l@z           N  plu    41
   cellarage    cel-lar-age     "sE-l@-rIdZ       N  sing    0
   cellarages   cel-lar-ag-es   "sE-l@-rI-dZIz    N  plu     0
   cellist      cel-list        "tSE-lIst         N  sing    5
   cellists     cel-lists       "tSE-lIsts        N  plu     0
   cello        cel-lo          "tSE-l@U          N  sing   24
   cellos       cel-los         "tSE-l@Uz         N  plu     1
   cellular     cel-lu-lar      "sEl-jU-l@r*      A  pos    21
   celluloid    cel-lu-loid     "sEl-jU-lOId      N  sing   29
   -----------------------------------------------------------

Known Bugs and Errors on the CD (Second Release, 1995)

Top README, line 615:
Dutch Lexicology (INL) (42,380,000 million words in all): --> Dutch Lexicology (INL) (42,380,000 words in all):
Top README, line 725:
The corpus used by CELEX for deriving the German (as yetundisambiguated) --> The corpus used by CELEX for deriving the German (as yet undisambiguated)
C-programfile c/joinwlbe.c, line 87 and
C-programfile c/joinwlle.c, line 87:
fpin3 = fopen(filename,"r+b"); --> fpin3 = fopen(filename,"rb");
This is a non-trivial error, which causes the joinwl-program to exit without accomplishing its task!
Mistakenly, the function assumes the corresponding '*.idx' index files to be writable, which of course they are not on a non-writable CD. This erroneously results in an error message that the file does not exist. As the '*.idx'-file needs to be only readable, the above patch-up tests just whether it is a readable binary.
Please rewrite the program as shown above and re-compile it.

UNIX is a trademark of AT&T Bell Laboratories
PostScript is a trademark of Adobe Systems Incorporated
Oracle and SQL*PLUS are trademarks of Oracle Corporation