The CD-ROM Version of the CELEX Lexical Database Second Release, August 1995 ============================================================================ ============================================================================ Table of Contents 1. CELEX 2. The Databases 3. 1995 Extensions & Revisions 4. The User Guide 5. Copyright 6. Directory Structure 7. On Using AWK to Process the Lexical Files 8. File Sizes 9. Organization 10. Original Sources of the CELEX Database ============================================================================ ============================================================================ 1. CELEX CELEX is the Dutch Centre for Lexical Information. It was developed as a joint enterprise of the University of Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck Institute for Psycholinguistics in Nijmegen, and the Institute for Perception Research in Eindhoven. Over the years it has been funded mainly by the Netherlands Organization for Scientific Research (NWO) and the Dutch Ministry of Science and Education. CELEX is now part of the Max Planck Institute for Psycholinguistics. ============================================================================ ============================================================================ 2. The Databases This CD-ROM contains plain ASCII versions of the CELEX lexical databases of English (version 2.5), Dutch (version 3.1) and German (version 2.5). The original CELEX databases can be consulted interactively either by using the SQL*PLUS query language within an ORACLE RDBMS environment, or by means of the specially designed user interface FLEX. As the FLEX interface has been written to communicate with the underlying UNIX operating system and the ORACLE software, it is completely bound to this particular configuration and hence cannot be distributed separately and does not feature on the CD-ROM. To make for greater compatibility with other operating systems, the databases on the CD-ROM have not been tailored to fit any particular database management program. Instead, the information is presented in a series of plain ASCII files that can be queried with tools such as AWK and ICON. Unique identity numbers allow the linking of information from different files. As in the original databases, some kinds of information have to be computed on-line. Wherever necessary, AWK scripts have been provided to recover this information. Also, some C-programs have been included, along with their MS-DOS executables and HP-UX (Hewlett-Packard UNIX) binaries. README files specify the details of their use. ============================================================================= ============================================================================= 3. 1995 Extensions & Revisions - About 1000 new German lemma entries. - Revised German morphological parses, verb argument structures, and inflectional paradigm codes. - Inclusion of a German Corpus Type Lexicon. - Frequency counts included in all lemma and wordform files for ease of use. - Preferred spelling variants always given as main entries. - Syllable frequency counts for Dutch and English. - Full awk scripts supplied instead of isolated functions. - Provision of efficient, index-based C-program for joining wordform and lemma files. - Complete version of the German Linguistic Guide. - User Guide files in both European A4 and American Letter PostScript format. ============================================================================= ============================================================================= 4. The User Guide The CELEX User Guide describes in detail the kinds of information in the databases using unique labels for each column in the RDB. For instance, the syllabified phonological headword information for English lemmas, with stress markers, in the CPA character set, is referred to as "PhonStrsCPA". The README files describing the columns of the lexical files on this CD-ROM use the same labels. In order to facilitate the use of this CD-ROM, the relevant sections in the User Guide on English, German and Dutch have been made available as PostScript files in both European A4 and American Letter format. Users preferring a bound hardcopy of the CELEX User Guide (Dfl 115,--) should send a request by electronic mail to celex@mpi.nl (INTERNET) or by surface mail to Richard Piepenbrock (CELEX project manager) Max Planck Institute for Psycholinguistics P.O. Box 310 NL-6500 AH Nijmegen THE NETHERLANDS P.S. For additional information (also on the on-line database version) and the latest news on updates and the like, you are invited to consult the CELEX homepage on the World Wide Web at http://www.kun.nl/celex ============================================================================ ============================================================================ 5. Copyright Copyright Centre for Lexical Information LICENCE: The copyright holder grants to the purchaser of this CD-ROM unrestricted license to use all the lexical information included herein for research purposes only, subject to the following restrictions: 1. No onward distribution of the lexical data is allowed -- copies may be made only for use by the purchaser and her/his research group, for ease of use by that group, etc.; 2. The contribution of CELEX is acknowledged in any public presentation or publication of any work based on the lexicons. (*) CELEX carries no warranty of any kind. (*) This CD-ROM should be referred to as: R. H. Baayen, R. Piepenbrock & L. Gulikers, The CELEX Lexical Database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, 1995. ============================================================================ ============================================================================ 6. Directory Structure root: README (the README file you are reading now) english (directory containing the English database) german (directory containing the German database) dutch (directory containing the Dutch database) awk (directory containing awk scripts for joining columns of different files) intro_a4.ps (PostScript file with the introduction to CELEX of the CELEX User Guide, in European A4-format) intro_let.ps (PostScript file with the introduction to CELEX of the CELEX User Guide, in American Letter format) english: README eol (directory, English Orthography, Lemmas) epl (directory, English Phonology, Lemmas) eml (directory, English Morphology, Lemmas) esl (directory, English Syntax, Lemmas) efl (directory, English Frequency, Lemmas) eow (directory, English Orthography, Wordforms) epw (directory, English Phonology, Wordforms) emw (directory, English Morphology, Wordforms) efw (directory, English Frequency, Wordforms) ect (directory, English Corpus Types) efs (directory, English Frequency, Syllables) eug_a4.ps (PostScript file of the CELEX User Guide on English, in European A4-format) eug_let.ps (PostScript file of the CELEX User Guide on English, in American Letter format) german: README gol (directory, German Orthography, Lemmas) gpl (directory, German Phonology, Lemmas) gml (directory, German Morphology, Lemmas) gsl (directory, German Syntax, Lemmas) gfl (directory, German Frequency, Lemmas) gow (directory, German Orthography, Wordforms) gpw (directory, German Phonology, Wordforms) gmw (directory, German Morphology, Wordforms) gfw (directory, German Frequency, Wordforms) gct (directory, German Corpus Types) gug_a4.ps (PostScript file of the CELEX User Guide on German, in European A4-format) gug_let.ps (PostScript file of the CELEX User Guide on German, in American Letter format) dutch: README dol (directory, Dutch Orthography, Lemmas) dpl (directory, Dutch Phonology, Lemmas) dml (directory, Dutch Morphology, Lemmas) dsl (directory, Dutch Syntax, Lemmas) dfl (directory, Dutch Frequency, Lemmas) dow (directory, Dutch Orthography, Wordforms) dpw (directory, Dutch Phonology, Wordforms) dmw (directory, Dutch Morphology, Wordforms) dfw (directory, Dutch Frequency, Wordforms) dct (directory, Dutch Corpus Types) dab (directory, Dutch Abbreviations) dfs (directory, Dutch Frequency, Syllables) dug_a4.ps (PostScript file of the CELEX User Guide on Dutch, in European A4-format) dug_let.ps (PostScript file of the CELEX User Guide on Dutch, in American Letter format) Each of these directories has the following structure: README (a help file) [egd][opmsf..][lw..].cd (the database file) awk (directory containing AWK scripts) c (directory with a C program, if present) ============================================================================ ============================================================================ 7. On Using AWK to Process the Lexical Files a. File Structure All lexicon files are characterized by the following properties: (1) The field separator is the backslash ("\"). (2) The record separator is the newline character. Due to the different implementation of this character on various platforms (linefeed and/or carriage return) records may run off the screen when displayed using commands like 'type' or 'cat'. Reading (parts of) files into a standard editor or retrieving them with the AWK 'print' statement will solve this problem. (2) The first field contains the unique identity number IdNum, which can be used to join columns of different lexicons. (3) Each file contains the standard spelling of all its entries in order to maintain immediate interpretability. (4) Note that fields may be empty (for instance, monomorphemic words have empty fields for the morphological constituency information). (5) Variants (spellings, pronunciation, alternative parsings) of the same lexical entry are always listed on a single line, fixed column positions specifying the alternatives. Where this convention could not be followed, separate files (described in the accompanying README files) with the additional variants have been added. b. Tools for Processing the Files Columns can be selected from the lexical files using tools such as AWK (A.V.Aho, B.W.Kernigan & P.J.Weinberger, The AWK Programming Language, New York: Addison-Wesley, 1988) or ICON (R.E.Griswold & M.T.Griswold, The Icon Programming Language, Englewoods Cliffs, New Jersey: Prentice-Hall, 1990). As all AWK scripts on the CD have been interpreted and tested using GAWK (GNU AWK) version 2.15, patchlevel 5 on UNIX, and GAWK (GNU AWK) version 2.15, patchlevel 6 (16 bit version) on MS-DOS, we recommend using this freely available version of AWK to recover the lexical data (although it should be stated that we have no particular interest in promoting GAWK or singing the praises of its authors!). GAWK is available in a number of versions from various ftp-sites: - OS/2 32-bit version ftp-os2.cdrom.com:pub/os2/32bit/unix/gnuawk.zip - OS/2 16-bit version ftp-os2.cdrom.com:pub/os2/16bit/unix/gawk2156.zip - DOS 16-bit, and OS/2 and DOS 32-bit version (gawk.exe, gawk-emx.exe) oak.oakland.edu:SimTel/msdos/awk/ with executables in gawk2156.zip and sources in gawk215t.zip - The original GNU sources prep.ai.mit.edu:pub/gnu/gawk-2.15.6.tar.gz Alternative sites of interest to European users are ftp.tu-ilmenau.de and ftp.uni-tuebingen.de (both in Germany), both under directory 'pub/gnu', and src.doc.ic.ac.uk (SunSITE London), under directory 'gnu'. Asian users may refer to ftp.uec.ac.jp/pub/wwfs/GNU in Tokyo, Japan. The complete manual (200 pages) for GNU awk is available with the sources. A printed manual is available from the Free Software Foundation. You can find information on GNU's manuals, disks, and the GNU project in the GNU's Bulletin, available on the newsgroup gnu.announce, or by sending a self-addressed stamped envelope ($0.52) to Free Software Foundation 675 Massachusetts Avenue Cambridge, MA 02139 c. Computing the Representation of Missing Fields Many lexicon files on this CD-ROM do not contain all the columns specified in the CELEX User Guide. However, all missing fields can be recovered from the CD-ROM files using the AWK scripts in the awk directories, as specified in the corresponding README files. (We have opted for AWK rather than ICON in view of the greater availability of AWK as a standard tool of the UNIX operating system.) The use of AWK scripts rather than providing a full listing of all lexical information is motivated by the following considerations: a. The original CELEX databases do not store all columns either, and similarly require on-line computation of lexical fields. b. The lexical files on CD-ROM never exceed 28 megabytes in size. If so required, any file can be downloaded from the CD-ROM for efficient local processing. If all columns had been fully spelled out, the files would have been prohibitively large. c. The lexical files on CD-ROM have a transparent column structure that can be inspected by UNIX utilities such as head, more, or less, and even with a text editor such as vi or emacs, mostly without loss of transparency due to excessive line overflow. d. Accessing a CD-ROM is relatively time-consuming, suggesting that local computation of derived information is to be preferred. Whenever on-line derivation of missing fields is required, users can run the ready-made AWK scripts (*) in the awk subdirectory of the corresponding lemma or wordform files. This is done as follows: awk -f scriptname.awk LexiconFile LexField The LexField indicates the field in the lexicon file from which the additional representation should be computed, e.g. awk -f sortstr.awk eol.cd 2 to process the alphabetically sorted representation of the Headword spelling (HeadLowSort) from the HeadDia field (2). In the case of the phonological representation and the syntactic codes, extra arguments are required, which is specified in the READMEs of their respective subdirectories. Generally speaking, the scripts only yield the missing representation of one particular field or column, e.g. only the HeadLowSort and no other fields. This can be easily modified, however, by copying the script to your hard disk and adapting the 'printf' statement to include more fields, for example: printf("%s\n",LexInfo_1); => printf("%s\\%s\\%s\\%s\n",$1,$2,$3,LexInfo_1); to retrieve fields 1, 2 and 3 as well, each separated by backslashes (LexInfo_1 always refers to the LexField supplied on the command line, as the field that should be converted rather than just retrieved. (*) The first release of this CD-ROM contained only isolated AWK-functions without 'BEGIN'-segments or print-statements. The original functions still feature in the body of the current AWK-scripts, however, so they can be easily extracted and modified or combined to perform user-defined operations not 'pre-cooked' by us. d. Joining Files of Similar Lexical Status It is often necessary to combine information in files located in different directories. To join columns of different files, the unique identity numbers (IdNum) listed in the first field of every lexical file can serve as linking keys. PLEASE NOTE THAT NO OTHER FIELDS SHOULD BE USED! Homographs, homophones, and other spuriously identical fields may wreak havoc with your intended personal lexicon when other join keys are used. Also note that the wordform lexicons have TWO unique identity numbers, one providing a unique wordform description, to be used for linking wordform lexicons, the other providing a link to the lemma lexicons, allowing one to link wordforms with the lexical information on their corresponding lemmas. An AWK script specifically designed to join columns of two (large) input files OF SIMILAR LEXICAL STATUS (two lemma files or two wordform files) can be found in the awk subdirectory of the present directory. The join.awk script is called as follows: awk -f join.awk file1 file2 LexField_file1 LexField_file2 > your_result File1 and file2 denote the two input files. LexField_file1 and LexField_file2 specify the field numbers of the lexical information in file1 and file2 that is to be combined in the output file "your_result". This file has the format IdNum \\ LexInfo_1 TAB LexInfo_2 This format allows new information from a third file to be joined with the file "your_result": awk -f join.awk your_result file3 2 LexField_file3 > your_result2 will result in a file with the fields IdNum \\ LexInfo_1 TAB LexInfo_2 TAB LexInfo_3 In this way information from several lexicons can be combined by successive application of join.awk. For example, if the Dutch PhonStrsDISC column (field 4 in dpl.cd) is to be combined with the Immediate morphological segmentation Imm (field 9 of dml.cd) and the word category ClassNum (field 4 of dsl.cd), the join.awk script should then be called as follows: awk -f join.awk dpl.cd dml.cd 4 9 > tmp awk -f join.awk tmp dsl.cd 2 4 > "your_result" rm tmp The join.awk script presupposes that both input files are sorted by their IdNum, with the IdNum being the first field, as is the case for all lexical files. d. Joining Files of Dissimilar Lexical Status Special care is required for joining a wordform lexicon with a lemma lexicon. The lemma unique identity numbers in the wordform lexicon (IdNumLemma) are NOT sorted numerically (these files are sorted by the wordform IdNum). Hence join.awk cannot be used. In order to join a wordform lexicon with a lemma lexicon, a C-program is included in the c subdirectory of the present directory. Using an AWK script for this purpose was found to be unfeasible, as this caused the system to run out of memory even when run on a UNIX machine. The program, joinwl.c, makes use of an index file recording the initial byte position of every single line in the lemma file, thus speeding up the retrieval process. It comes in TWO VERSIONS, joinwlle.c for so-called Little-Endian machines, and joinwlbe.c for Big-Endian machines. The Endian status of the computer depends on the way the processor interprets multi-byte data types, which (unfortunately!) is important as the byte offsets in the index files are given in four-byte values. You have to determine the Endian status of your machine BEFORE applying either the joinwlle.c or the joinwlbe.c program by running the endian.c program provided in the same directory. The index files, with the same name as the lemma lexicon and the extension 'idx', such as 'dol.idx' for Dutch orthography, are provided alongside the lemma lexicons. The program is called as follows: joinwl(le|be) wordformfile lemmafile "wf_field[|wf_field]" "lm_field[|lm_field]" The wordformfile and the lemmafile here denote the two input files. The third argument should contain the numbers of one or more fields to be retrieved from the wordform file, while the fourth argument specifies the field numbers from the lemma file. For each argument, field numbers should be enclosed in double quotes, and separated by vertical bars. The result is written to an output file in your current directory with the name of the wordform file and the extension 'out'. Note that specification of the double quotes is obligatory, and that the wordform lexicon should be specified BEFORE the lemma lexicon. Thus, if the WordSylDia information of the Dutch wordform lexicon (field 9) and its frequency (field 3) should be combined with the HeadSylDia information of the lemma lexicon (8) and its respective frequency count (3), the program should be called as follows: joinwl(le|be) dow.cd dol.cd "9|3" "8|3" The output of this program is automatically written to the file 'dow.out' in your working directory. If combining the two files should fail due to platform-specific interpretation of the initial byte positions (on account of different newline characters e.g), or if you want to combine self-created files derived from the original files, you can generate an index yourself with the 'make_idx' program in the same directory. This should be run as follows: make_idx lemmafile 1 The '1' here denotes the field number of the unique IdNum key, which is specified as the first field by default. ============================================================================ ============================================================================ 8. File Sizes dutch/dab/dab.cd 69399 bytes dutch/dct/dct.cd 4107513 bytes dutch/dfl/dfl.cd 3574188 bytes dutch/dfs/dfs.cd 366605 bytes dutch/dfw/dfw.cd 12527494 bytes dutch/dml/dml.cd 10433870 bytes dutch/dmw/dmw.cd 11159555 bytes dutch/dol/dol.cd 10229951 bytes dutch/dow/dow.cd 19185443 bytes dutch/dpl/dpl.cd 16300823 bytes dutch/dpw/dpw.cd 28837473 bytes dutch/dsl/dsl.cd 4092534 bytes english/ect/ect.cd 5861603 bytes english/efl/efl.cd 2143469 bytes english/efs/efs.cd 263859 bytes english/efw/efw.cd 7158616 bytes english/eml/eml.cd 4918610 bytes english/emw/emw.cd 4936327 bytes english/eol/eol.cd 2068262 bytes english/eow/eow.cd 7351883 bytes english/epl/epl.cd 5480381 bytes english/epw/epw.cd 15567897 bytes english/esl/esl.cd 5560364 bytes german/gct/gct.cd 7481411 bytes german/gfl/gfl.cd 2155299 bytes german/gfw/gfw.cd 17283964 bytes german/gml/gml.cd 4941770 bytes german/gmw/gmw.cd 12342460 bytes german/gol/gol.cd 3095599 bytes german/gow/gow.cd 15932928 bytes german/gpl/gpl.cd 6669899 bytes german/gpw/gpw.cd 28982785 bytes german/gsl/gsl.cd 2537557 bytes -------------- + 283619791 bytes (283,6 meg) ============================================================================ ============================================================================ 9. Organization Project Manager CELEX 1985-1992: H. Kerkman Project Manager CELEX 1993- : R. Piepenbrock Original Programming Concept : Dr. R. H. Baayen 1995 Extensions & Revisions : R. Piepenbrock & L. Gulikers, aided and abetted by H. Drexler Cover Design : I. Doehring CELEX Board: Dr. W. J. M. Levelt (Max Planck Institute for Psycholinguistics, chair) Dr. R. Collier (Institute for Perception Research) Dr. R. Schreuder (Nijmegen University) Dr. P. van Sterkenburg (Institute for Dutch Lexicology) ============================================================================ ============================================================================ 10. Original Sources of the CELEX Database An overview of the sources used to compile the CELEX database, specified for each language. DUTCH ----- - Van Dale's Comprehensive Dictionary of Contemporary Dutch (1984): 80,000 lemmata - Word List of the Dutch Language ('Groene Boekje') (1954), revised version: 65,000 lemmata - The most frequent lemmata from the text corpus of the Institute for Dutch Lexicology (INL) (42,380,000 million words in all): 15,000 lemmata Note that there is a considerable overlap between the two dictionary sources (of approx. 45,000 lemmata). Other lemmata were added to enable morphological decomposition of the basic set of lemmata. When compared with the INL text corpus, the coverage of CELEX-lemmata is 95% of the total corpus. This figure is fairly skewed, because in order to reduce the bulk of the the INL corpus type list, all hapax legomena (which included many OCR (scanning) errors) were omitted and therefore cannot be retrieved from the corpus type list. The current corpus type list comprises 211,389 entries, as opposed to the original 321,000 inclusive of the hapax legomena. The INL corpus, in the version derived for calculating the CELEX frequencies, consists of 930 entire fiction and non-fiction books (approx. 30% fiction, 70% non-fiction) published between 1970 and 1988. Newspapers, magazines, children's books, textbooks and specialist literature do not feature in the collection. The CELEX version is static, although the INL corpus continues to be expanded. (See S. Hazenberg (1994). Een Keur van Woorden. VU Amsterdam. (p. 36 ff.) and J.G. Kruyt (1995). 'Nationale Tekstcorpora in Internationaal Perspectief'. Forum der Letteren, 36-1, 47-58). ENGLISH ------- - Oxford Advanced Learner's Dictionary (1974): 41,000 lemmata - Longman Dictionary of Contemporary English (1978): 53,000 lemmata Note that there is a considerable overlap between the two dictionary sources (of approx. 30,000 lemmata). Other lemmata were added to enable morphological decomposition of the basic set of lemmata. No extra lemmata have as yet been added from a text corpus. Nevertheless, when compared with the 17.9 million word corpus of Birmingham University/COBUILD, the coverage of CELEX-lemmata is 92% of the total corpus. The 17.9 million token COBUILD/Birmingham corpus, on which the CELEX frequencies have been based, contains some American English texts. These are exclusively written texts (from a total of 16.6m tokens), and not spoken ones (from a total of 1.3m). There are 284 written texts in all, 44 of which are of indisputably American origin. These American texts make up 15.4% of the total written corpus. Other texts are by authors originating from countries with other English dialects, or are difficult to classify, such as Alistair Cooke. Paradoxically, even when a text sample is of American origin, the spelling has nearly always been adapted to a British English standard, as the corpus was compiled in England where they had to make do with British English editions of these texts. Thus Irwin Shaw's 'Rich Man Poor Man' appears in a British English edition. In this respect, we were constrained by the limitations of the sources as offered to us. As for the wordform and lemma-based frequencies, we have not taken the sources into account, as the disambiguation of highly frequent words would necessitate lengthy, labour-intensive rounds of coding. Therefore, we have taken random samples of these types, disambiguated these, and extrapolated the results for the total type frequency. This means we have lost the connection to distinct textual categories like 'newspaper' or 'American English'. When a type is not ambiguous, however, you can go to the CD-ROM directory /english/ect, where the fifth field (FreqWA) of the file ect.cd gives you the exact written American frequency (the ect.cd file contains raw string frequencies only). For the ambiguous types, it might be the case that the division BrE-AmE gives you some idea of the relative frequencies for each wordform. In most cases, clearly, this cannot be done with anything approaching certainty. For a detailed account of the text samples making up the COBUILD/Birmingham corpus, see J.M. Sinclair (Ed.) (1987). Looking Up. London/Glasgow: Collins/COBUILD. The corpus used by CELEX is an extended version of the Reserve Corpus described on p. 10 ff. GERMAN ------ - Bonnlex, tapes supplied by the Institute for Communication Research and Phonetics in Bonn - Molex, tapes supplied by the Institute for German Language in Mannheim - Noetic Circle Services (MIT) German spelling lexicon As all sources were genuine computer data rather than electronic versions (or typesetting tapes) of paper dictionaries as was the case with the sources for the other languages, and all contained a variety of flections, stems and lemmata, no figures can be given as to what lemmata derive from exactly which source. In most cases, inflectional data from one tape were merged with stem information from another. Other lemmata were added to enable morphological decomposition of the basic set of lemmata. No extra lemmata have as yet been added from a text corpus. When compared with the 6 million word corpus of the Institute for German Language at Mannheim, the coverage of CELEX lemmata is 83% of the total corpus. The corpus used by CELEX for deriving the German (as yetundisambiguated) lemma and wordform frequencies consists of 5.4 million German tokens from written texts like newspapers, fiction and non-fiction, and 600,000 tokens of transcribed speech. The former is a combination of the 'Mannheimer Korpus I', 'Mannheimer Korpus II' and the 'Bonner Zeitungskorpus 1', while the latter is known as the 'Freiburger Korpus'. All of these can also be consulted on request by remote login to the Institut fuer Deutsche Sprache in Mannheim through the COSMAS interface. The corpus is relatively ill-balanced, especially in view of its small size, since novels like Grass' 'Die Blechtrommel' and Boell's 'Ansichten eines Clowns' are included in their entirety. All texts in the corpus were published or recorded between 1949 - 1975. ============================================================================ ============================================================================ UNIX is a trademark of AT&T Bell Laboratories HP-UX is a trademark of Hewlett-Packard Company PostScript is a trademark of Adobe Systems Incorporated Oracle and SQL*PLUS are trademarks of Oracle Corporation