CELEX is the Dutch Centre for Lexical Information. It was developed as a joint enterprise of the University of Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck Institute for Psycholinguistics in Nijmegen, and the Institute for Perception Research in Eindhoven. Over the years it has been funded mainly by the Netherlands Organisation for Scientific Research (NWO) and the Dutch Ministry of Science and Education. CELEX is now part of the Max Planck Institute for Psycholinguistics.
This CD-ROM contains plain ASCII versions of the CELEX lexical databases of English (version 2.5), Dutch (version 3.1) and German (version 2.5). The original CELEX databases can be consulted interactively either by using the SQL*PLUS query language within an ORACLE RDBMS environment, or by means of the specially designed user interface FLEX. As the FLEX interface has been written to communicate with the underlying UNIX operating system and the ORACLE software, it is completely bound to this particular configuration and hence cannot be distributed separately and does not feature on the CD-ROM.
To make for greater compatibility with other operating systems, the databases on the CD-ROM have not been tailored to fit any particular database management program. Instead, the information is presented in a series of plain ASCII files that can be queried with tools such as AWK and ICON. Unique identity numbers allow the linking of information from different files.
As in the original databases, some kinds of information have to be computed on-line. Wherever necessary, AWK scripts have been provided to recover this information. Also, some C-programs have been included, along with their MS-DOS executables and HP-UX (Hewlett-Packard UNIX) binaries. README files specify the details of their use.
The CD-ROM is mastered using the ISO 9660 data format, with the Rock Ridge extensions, allowing it to be used in VMS, MS-DOS(*), Macintosh (**) and UNIX(***) environments.
Anyone who would like to purchase the CD-ROM should send a check or purchase order made payable to the "Trustees of the University of Pennsylvania" to
Linguistic Data Consortium 441 Williams Hall University of Pennsylvania Philadelphia, PA 19104-6305 ldc@unagi.cis.upenn.edu Tel: +1/215/898-0464 Fax: +1/215/573-2175(*) PC-users may experience some difficulty reading the README files on the CD, as these were compiled on a UNIX system and as a consequence lack carriage return characters at the end of lines. This may result in a near-continuous string of characters, broken only by linefeeds, running off the righthand side of the screen. Reading these files into word processors like WordPerfect or MS-Word will usually correct this, as these automatically detect and convert the linefeeds found in the text. The same problem will appear when scrolling the lexical data files, but this lack of carriage returns will not affect the functioning of the AWK programs. Because of the large size of these files, we do not recommend reading lexical data files into word processors. Try to retrieve the items you want by using the predefined AWK-scripts or more-or-less standard DOS tools such as 'find' and 'grep' for retrieving single records.
(**) If someone has a Mac with a CD-ROM drive that was obtained before 12/92, and has not installed any system upgrades since that date, then that system will not be able to read the CELEX CD-ROM. In such a case, all that is needed is to obtain the upgraded driver software (a very small amount of code), and copy it onto the system in place of the existing driver. The upgrade can be obtained as follows:
Connect to ftp server: ftp.apple.com
Go to directory: dts/mac/sys.soft/cdrom
Get file: cd-rom-setup
(***) Some UNIX systems will have trouble displaying the contents of the lexical data and documentation files due to the presence of a semicolon plus a version number at the end of the filetype. These have been included to conform to the ISO 9660 standard required for CD-ROM production. If your system interprets the semicolon as a command line delimiter, either escape the semicolon with a backslash or access the file by replacing the semicolon and number with wildcards, e.g. two question marks.
The CELEX User Guide describes in detail the kinds of information in the databases using unique labels for each column in the RDB. For instance, the syllabified phonological headword information for English lemmas, with stress markers, in the CPA character set, is referred to as "PhonStrsCPA".
The README files describing the columns of the lexical files on this CD-ROM use the same labels. In order to facilitate the use of this CD-ROM, the relevant sections in the User Guide on English, German and Dutch have been made available as PostScript files in both European A4 and American Letter format. Users preferring a bound hardcopy of the CELEX User Guide (Dfl 115,--) should send a request by electronic mail to
celex@mpi.nl (INTERNET)or by surface mail to
Richard Piepenbrock (CELEX project manager) Max Planck Institute for Psycholinguistics P.O. Box 310 NL-6500 AH Nijmegen THE NETHERLANDS
P.S. For additional information (also on the on-line database version) and the latest news on updates and the like, you are invited to consult the CELEX homepage on the World Wide Web at
http://www.kun.nl/celex/
Copyright Centre for Lexical Information
LICENCE: The copyright holder grants to the purchaser of this CD-ROM unrestricted license to use all the lexical information included herein for research purposes only, subject to the following restrictions:
(*) This CD-ROM should be referred to as: R. H. Baayen, R. Piepenbrock & L. Gulikers, The CELEX Lexical Database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, 1995.
All lexicon files are characterized by the following properties:
Columns can be selected from the lexical files using tools such as AWK (A.V.Aho, B.W.Kernigan & P.J.Weinberger, The AWK Programming Language, New York: Addison-Wesley, 1988) or ICON (R.E.Griswold & M.T.Griswold, The Icon Programming Language, Englewoods Cliffs, New Jersey: Prentice-Hall, 1990). As all AWK scripts on the CD have been interpreted and tested using GAWK (GNU AWK) version 2.15, patchlevel 5 on UNIX, and GAWK (GNU AWK) version 2.15, patchlevel 6 (16 bit version) on MS-DOS, we recommend using this freely available version of AWK to recover the lexical data (although it should be stated that we have no particular interest in promoting GAWK or singing the praises of its authors!).
GAWK is available in a number of versions from various ftp-sites:
Alternative sites of interest to European users are ftp.tu-ilmenau.de and ftp.uni-tuebingen.de (both in Germany), both under directory 'pub/gnu', and src.doc.ic.ac.uk (SunSITE London), under directory 'gnu'. Asian users may refer to ftp.uec.ac.jp/pub/wwfs/GNU in Tokyo, Japan.
The complete manual (200 pages) for GNU awk is available with the sources. A printed manual is available from the Free Software Foundation. You can find information on GNU's manuals, disks, and the GNU project in the GNU's Bulletin, available on the newsgroup gnu.announce, or by sending a self-addressed stamped envelope ($0.52) to
Free Software Foundation 675 Massachusetts Avenue Cambridge, MA 02139
Many lexicon files on this CD-ROM do not contain all the columns specified in the CELEX User Guide. However, all missing fields can be recovered from the CD-ROM files using the AWK scripts in the awk directories, as specified in the corresponding README files. (We have opted for AWK rather than ICON in view of the greater availability of AWK as a standard tool of the UNIX operating system.) The use of AWK scripts rather than providing a full listing of all lexical information is motivated by the following considerations:
Whenever on-line derivation of missing fields is required, users can run the ready-made AWK scripts (*) in the awk subdirectory of the corresponding lemma or wordform files. This is done as follows:
awk -f scriptname.awk LexiconFile LexField
The LexField indicates the field in the lexicon file from which the additional representation should be computed, e.g.
awk -f sortstr.awk eol.cd 2
to process the alphabetically sorted representation of the Headword spelling (HeadLowSort) from the HeadDia field (2).
In the case of the phonological representation and the syntactic codes, extra arguments are required, which is specified in the READMEs of their respective subdirectories.
Generally speaking, the scripts only yield the missing representation of one particular field or column, e.g. only the HeadLowSort and no other fields. This can be easily modified, however, by copying the script to your hard disk and adapting the 'printf' statement to include more fields, for example:
printf("%s\n",LexInfo_1); => printf("%s\\%s\\%s\\%s\n",$1,$2,$3,LexInfo_1);
to retrieve fields 1, 2 and 3 as well, each separated by backslashes (LexInfo_1 always refers to the LexField supplied on the command line, as the field that should be converted rather than just retrieved.
(*) The first release of this CD-ROM contained only isolated AWK-functions without 'BEGIN'-segments or print-statements. The original functions still feature in the body of the current AWK-scripts, however, so they can be easily extracted and modified or combined to perform user-defined operations not 'pre-cooked' by us.
It is often necessary to combine information in files located in different directories. To join columns of different files, the unique identity numbers (IdNum) listed in the first field of every lexical file can serve as linking keys. PLEASE NOTE THAT NO OTHER FIELDS SHOULD BE USED! Homographs, homophones, and other spuriously identical fields may wreak havoc with your intended personal lexicon when other join keys are used. Also note that the wordform lexicons have TWO unique identity numbers, one providing a unique wordform description, to be used for linking wordform lexicons, the other providing a link to the lemma lexicons, allowing one to link wordforms with the lexical information on their corresponding lemmas.
An AWK script specifically designed to join columns of two (large) input files OF SIMILAR LEXICAL STATUS (two lemma files or two wordform files) can be found in the awk subdirectory of the present directory. The join.awk script is called as follows:
awk -f join.awk file1 file2 LexField_file1 LexField_file2 > your_result
File1 and file2 denote the two input files. LexField_file1 and LexField_file2 specify the field numbers of the lexical information in file1 and file2 that is to be combined in the output file "your_result". This file has the format
IdNum \\ LexInfo_1 TAB LexInfo_2
This format allows new information from a third file to be joined with the file "your_result":
awk -f join.awk your_result file3 2 LexField_file3 > your_result2
will result in a file with the fields
IdNum \\ LexInfo_1 TAB LexInfo_2 TAB LexInfo_3
In this way information from several lexicons can be combined by successive application of join.awk.
For example, if the Dutch PhonStrsDISC column (field 4 in dpl.cd) is to be combined with the Immediate morphological segmentation Imm (field 9 of dml.cd) and the word category ClassNum (field 4 of dsl.cd), the join.awk script should then be called as follows:
awk -f join.awk dpl.cd dml.cd 4 9 > tmp awk -f join.awk tmp dsl.cd 2 4 > "your_result" rm tmp
The join.awk script presupposes that both input files are sorted by their IdNum, with the IdNum being the first field, as is the case for all lexical files.
Special care is required for joining a wordform lexicon with a lemma lexicon. The lemma unique identity numbers in the wordform lexicon (IdNumLemma) are NOT sorted numerically (these files are sorted by the wordform IdNum). Hence join.awk cannot be used. In order to join a wordform lexicon with a lemma lexicon, a C-program is included in the c subdirectory of the present directory. Using an AWK script for this purpose was found to be unfeasible, as this caused the system to run out of memory even when run on a UNIX machine.
The program, joinwl.c, makes use of an index file recording the initial byte position of every single line in the lemma file, thus speeding up the retrieval process. It comes in TWO VERSIONS, joinwlle.c for so-called Little-Endian machines, and joinwlbe.c for Big-Endian machines. The Endian status of the computer depends on the way the processor interprets multi-byte data types, which (unfortunately!) is important as the byte offsets in the index files are given in four-byte values. You have to determine the Endian status of your machine BEFORE applying either the joinwlle.c or the joinwlbe.c program by running the endian.c program provided in the same directory.
The index files, with the same name as the lemma lexicon and the extension 'idx', such as 'dol.idx' for Dutch orthography, are provided alongside the lemma lexicons.
The program is called as follows:
joinwl(le|be) wordformfile lemmafile "wf_field[|wf_field]" "lm_field[|lm_field]"
The wordformfile and the lemmafile here denote the two input files. The third argument should contain the numbers of one or more fields to be retrieved from the wordform file, while the fourth argument specifies the field numbers from the lemma file. For each argument, field numbers should be enclosed in double quotes, and separated by vertical bars. The result is written to an output file in your current directory with the name of the wordform file and the extension 'out'. Note that specification of the double quotes is obligatory, and that the wordform lexicon should be specified BEFORE the lemma lexicon.
Thus, if the WordSylDia information of the Dutch wordform lexicon (field 9) and its frequency (field 3) should be combined with the HeadSylDia information of the lemma lexicon (8) and its respective frequency count (3), the program should be called as follows:
joinwl(le|be) dow.cd dol.cd "9|3" "8|3"
The output of this program is automatically written to the file 'dow.out' in your working directory.
If combining the two files should fail due to platform-specific interpretation of the initial byte positions (on account of different newline characters e.g), or if you want to combine self-created files derived from the original files, you can generate an index yourself with the 'make_idx' program in the same directory. This should be run as follows:
make_idx lemmafile 1
The '1' here denotes the field number of the unique IdNum key, which is specified as the first field by default.
dutch/dab/dab.cd 69399 bytes dutch/dct/dct.cd 4107513 bytes dutch/dfl/dfl.cd 3574188 bytes dutch/dfs/dfs.cd 366605 bytes dutch/dfw/dfw.cd 12527494 bytes dutch/dml/dml.cd 10433870 bytes dutch/dmw/dmw.cd 11159555 bytes dutch/dol/dol.cd 10229951 bytes dutch/dow/dow.cd 19185443 bytes dutch/dpl/dpl.cd 16300823 bytes dutch/dpw/dpw.cd 28837473 bytes dutch/dsl/dsl.cd 4092534 bytes english/ect/ect.cd 5861603 bytes english/efl/efl.cd 2143469 bytes english/efs/efs.cd 263859 bytes english/efw/efw.cd 7158616 bytes english/eml/eml.cd 4918610 bytes english/emw/emw.cd 4936327 bytes english/eol/eol.cd 2068262 bytes english/eow/eow.cd 7351883 bytes english/epl/epl.cd 5480381 bytes english/epw/epw.cd 15567897 bytes english/esl/esl.cd 5560364 bytes german/gct/gct.cd 7481411 bytes german/gfl/gfl.cd 2155299 bytes german/gfw/gfw.cd 17283964 bytes german/gml/gml.cd 4941770 bytes german/gmw/gmw.cd 12342460 bytes german/gol/gol.cd 3095599 bytes german/gow/gow.cd 15932928 bytes german/gpl/gpl.cd 6669899 bytes german/gpw/gpw.cd 28982785 bytes german/gsl/gsl.cd 2537557 bytes -------------- + 283619791 bytes (283,6 meg)
Project Manager CELEX 1985-1992: H. Kerkman
Project Manager CELEX 1993- : R. Piepenbrock
Original Programming Concept : Dr. R. H. Baayen
1995 Extensions & Revisions : R. Piepenbrock & L. Gulikers,
aided and abetted by H. Drexler
Cover Design : I. Doehring
CELEX Board:
Dr. W. J. M. Levelt (Max Planck Institute for Psycholinguistics, chair)
Dr. R. Collier (Institute for Perception Research)
Dr. R. Schreuder (Nijmegen University)
Dr. P. van Sterkenburg (Institute for Dutch Lexicology)
Many people have been involved in the CELEX project through the years. The list given below is an attempt to give an exhaustive overview of their respective contributions, although it must be added that dividing lines between these disciplines have by no means been strict.
Database and user interface design:
Eric Willems
Eddy Bronkhorst
Marcel van der Peijl
Maurice van Hinsberg
Vincent Karthaus
Hans Kerkman
Domien Kusters
Hans Drexler
System and database management:
Christa Hausmann-Jamin
Marcel Bingley
Ger Cox
Cees van der Veer
Richard Piepenbrock
The Dutch lexicon:
Hans Kerkman
Domien Kusters
Veronique Remmelts-Oudkerk
Ton van der Wouden
Gilbert Rattink
Wim Peters
The English lexicon:
Francoise Keulen
Domien Kusters
Gavin Burnage
Richard Piepenbrock
Caroline den Os
The German lexicon:
Gilbert Rattink
Leon Gulikers
Richard Piepenbrock
An overview of the sources used to compile the CELEX database, specified for each language.
Note that there is a considerable overlap between the two dictionary sources (of approx. 45,000 lemmata). Other lemmata were added to enable morphological decomposition of the basic set of lemmata.
When compared with the INL text corpus, the coverage of CELEX-lemmata is 95% of the total corpus. This figure is fairly skewed, because in order to reduce the bulk of the the INL corpus type list, all hapax legomena (which included many OCR (scanning) errors) were omitted and therefore cannot be retrieved from the corpus type list. The current corpus type list comprises 211,389 entries, as opposed to the original 321,000 inclusive of the hapax legomena.
The INL corpus, in the version derived for calculating the CELEX frequencies, consists of 930 entire fiction and non-fiction books (approx. 30% fiction, 70% non-fiction) published between 1970 and 1988. Newspapers, magazines, children's books, textbooks and specialist literature do not feature in the collection. The CELEX version is static, although the INL corpus continues to be expanded.
(See S. Hazenberg (1994). Een Keur van Woorden. VU Amsterdam. (p. 36 ff.) and J.G. Kruyt (1995). 'Nationale Tekstcorpora in Internationaal Perspectief'. Forum der Letteren, 36-1, 47-58).
Note that there is a considerable overlap between the two dictionary sources (of approx. 30,000 lemmata). Other lemmata were added to enable morphological decomposition of the basic set of lemmata.
No extra lemmata have as yet been added from a text corpus. Nevertheless, when compared with the 17.9 million word corpus of Birmingham University/COBUILD, the coverage of CELEX-lemmata is 92% of the total corpus.
The 17.9 million token COBUILD/Birmingham corpus, on which the CELEX frequencies have been based, contains some American English texts. These are exclusively written texts (from a total of 16.6m tokens), and not spoken ones (from a total of 1.3m). There are 284 written texts in all, 44 of which are of indisputably American origin. These American texts make up 15.4% of the total written corpus. Other texts are by authors originating from countries with other English dialects, or are difficult to classify, such as Alistair Cooke.
Paradoxically, even when a text sample is of American origin, the spelling has nearly always been adapted to a British English standard, as the corpus was compiled in England where they had to make do with British English editions of these texts. Thus Irwin Shaw's 'Rich Man Poor Man' appears in a British English edition. In this respect, we were constrained by the limitations of the sources as offered to us.
As for the wordform and lemma-based frequencies, we have not taken the sources into account, as the disambiguation of highly frequent words would necessitate lengthy, labour-intensive rounds of coding. Therefore, we have taken random samples of these types, disambiguated these, and extrapolated the results for the total type frequency. This means we have lost the connection to distinct textual categories like 'newspaper' or 'American English'.
When a type is not ambiguous, however, you can go to the CD-ROM directory /english/ect, where the fifth field (FreqWA) of the file ect.cd gives you the exact written American frequency (the ect.cd file contains raw string frequencies only).
For the ambiguous types, it might be the case that the division BrE-AmE gives you some idea of the relative frequencies for each wordform. In most cases, clearly, this cannot be done with anything approaching certainty.
For a detailed account of the text samples making up the COBUILD/Birmingham corpus, see
J.M. Sinclair (Ed.) (1987). Looking Up. London/Glasgow: Collins/COBUILD. (The corpus used by CELEX is an extended version of the Reserve Corpus described on p. 10 ff.)
As all sources were genuine computer data rather than electronic versions (or typesetting tapes) of paper dictionaries as was the case with the sources for the other languages, and all contained a variety of flections, stems and lemmata, no figures can be given as to what lemmata derive from exactly which source. In most cases, inflectional data from one tape were merged with stem information from another.
Other lemmata were added to enable morphological decomposition of the basic set of lemmata. No extra lemmata have as yet been added from a text corpus. When compared with the 6 million word corpus of the Institute for German Language at Mannheim, the coverage of CELEX lemmata is 83% of the total corpus.
The corpus used by CELEX for deriving the German (as yetundisambiguated) lemma and wordform frequencies consists of 5.4 million German tokens from written texts like newspapers, fiction and non-fiction, and 600,000 tokens of transcribed speech. The former is a combination of the 'Mannheimer Korpus I', 'Mannheimer Korpus II' and the 'Bonner Zeitungskorpus 1', while the latter is known as the 'Freiburger Korpus'. All of these can also be consulted on request by remote login to the Institut fuer Deutsche Sprache in Mannheim through the COSMAS interface. The corpus is relatively ill-balanced, especially in view of its small size, since novels like Grass' 'Die Blechtrommel' and Boell's 'Ansichten eines Clowns' are included in their entirety. All texts in the corpus were published or recorded between 1949 - 1975.
When starting to use the English database, the user first has to choose between three so-called `lexicon types':
For all types of lexicons, the user may subsequently select any number of columns -- from a total of approximately 950(!) database columns -- combining information on the orthography, phonology, morphology, syntax and frequency of the entries. The information sheet `Lexical Data, English' summarizes the types of information available. An exhaustive overview of the columns available is given in the CELEX User Guide.
LEXICAL DATA, ENGLISH
The lexical data that can be selected for each entry in the different English lexicon types can be divided into five categories: orthography, phonology, morphology, syntax and frequency. In a separate section, example data are given for each of these categories.
EXAMPLE DATA, ENGLISH
An arbitrary query using a small English lemma lexicon (that is, one with very few columns) might yield the following result:
----------------------------------------------------------- Headword Pronunciation Morphology: M: Cl Freq Structure Cl ----------- ---------------- ------------------- -- -- ---- celebrant "sE-lI-br@nt ((celebrate),(ant)) Vx N 6 celebration %sE-lI-"breI-Sn, ((celebrate),(ion)) Vx N 201 cell "sEl (cell) N N 1210 cellar "sE-l@r* (cellar) N N 228 cellarage "sE-l@-rIdZ ((cellar),(age)) Nx N 0 cellist "tSE-lIst ((cello),(ist)) Nx N 5 cello "tSE-l@U (cello) N N 25 cellular "sEl-jU-l@r* ((cell),(ular)) Nx A 21 celluloid "sEl-jU-lOId ((cellulose),(oid)) Nx N 29 -----------------------------------------------------------An example selection from a small English wordform lexicon, showing the inflectional variants of the headwords given in the previous example, is presented in the next table:
----------------------------------------------------------- Word Word division Pronunciation Cl Type Freq ------------ --------------- ----------------- -- ---- ---- celebrant cel-e-brant "sE-lI-br@nt N sing 2 celebrants cel-e-brants "sE-lI-br@nts N plu 4 celebration cel-e-bra-tion %sE-lI-"breI-Sn, N sing 144 celebrations cel-e-bra-tions %sE-lI-"breI-Sn,z N plu 57 cell cell "sEl N sing 655 cells cells "sElz N plu 555 cellar cel-lar "sE-l@r* N sing 187 cellars cel-lars "sE-l@z N plu 41 cellarage cel-lar-age "sE-l@-rIdZ N sing 0 cellarages cel-lar-ag-es "sE-l@-rI-dZIz N plu 0 cellist cel-list "tSE-lIst N sing 5 cellists cel-lists "tSE-lIsts N plu 0 cello cel-lo "tSE-l@U N sing 24 cellos cel-los "tSE-l@Uz N plu 1 cellular cel-lu-lar "sEl-jU-l@r* A pos 21 celluloid cel-lu-loid "sEl-jU-lOId N sing 29 -----------------------------------------------------------
Dutch Lexicology (INL) (42,380,000 million words in all): --> Dutch Lexicology (INL) (42,380,000 words in all):
The corpus used by CELEX for deriving the German (as yetundisambiguated) --> The corpus used by CELEX for deriving the German (as yet undisambiguated)
fpin3 = fopen(filename,"r+b"); --> fpin3 = fopen(filename,"rb");This is a non-trivial error, which causes the joinwl-program to exit without accomplishing its task!
Mistakenly, the function assumes the corresponding '*.idx' index files to be writable, which of course they are not on a non-writable CD. This erroneously results in an error message that the file does not exist. As the '*.idx'-file needs to be only readable, the above patch-up tests just whether it is a readable binary.
Please rewrite the program as shown above and re-compile it.