Home › Language Resources › Data

CELEX2

Item Name:	CELEX2
Author(s):	R H. Baayen, R Piepenbrock, L Gulikers
LDC Catalog No.:	LDC96L14
ISBN:	1-58563-085-3
ISLRN:	204-698-863-053-1
DOI:	https://doi.org/10.35111/gs6s-gm48
Member Year(s):	1995, 1996
DCMI Type(s):	Text
Data Source(s):	dictionaries
Project(s):	GALE, TIDES
Application(s):	parsing, pronunciation modeling, speech synthesis
Language(s):	English, German, Dutch
Language ID(s):	eng, deu, nld
License(s):	CELEX Agreement
Online Documentation:	LDC96L14 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Baayen, R H., R Piepenbrock, and L Gulikers. CELEX2 LDC96L14. Web Download. Philadelphia: Linguistic Data Consortium, 1995.
Related Works: Hide	View hasOutcome LDC97L18 CALLHOME German Lexicon LDC2026L04 CALLHOME German Lexicon Second Edition

Introduction

CELEX2 contains updated versions of the CELEX lexical databases of English (Version 2.5), Dutch (Version 3.1) and German (Version 2.0) developed by the University of Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck Institute for Psycholinguistics in Nijmegen, and the Institute for Perception Research in Eindhoven.

For each language, this data set contains detailed information on:

orthography (variations in spelling, hyphenation)
phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress)
morphology (derivational and compositional structure, inflectional paradigms)
syntax (word class, word class-specific subcategorizations, argument structures)
word frequency (summed word and lemma counts, based on recent and representative text corpora)

The databases were not tailored to fit any particular database management program. They are presented in ASCII files in a UNIX directory tree that can be queried with tools such as AWK or ICON. Unique identity numbers allow the linking of information from different files. Some information must be computed online; where necessary, AWK functions are provided to recover this information. README files specify the details of their use.

A detailed User Guide describing the lexical information available is included in the documentation accompanying this release. All sections of this guide are POSTSCRIPT files except for some additional ASCII notes on the German lexicon.

Data

This release contains an enhanced, expanded version of the German lexical database (2.5). Approximately 1,000 new lemmas were added for a total of 51,728; their inflected forms number 365,530. Also included are revised morphological parses, verb argument structures, inflectional paradigm codes and a corpus type lexicon. A complete PostScript version of the Germanic Linguistic Guide is included in the documentation accompanying this release.

Phonetic syllable frequencies were added for the English and Dutch databases along with frequency information alongside every lexical feature. No other changes were made to these lexicons.

Complete AWK-scripts are provided to compute representations not found in the ASCII lexical data files corresponding to the features described in CELEX User Guide.

Samples

Please view these samples:

Updates

Petra Stiener has developed a number of scripts to modify and update CELEX2 to a modern format. They are available on her github page. LREC papers related to these updates are accessible at the following urls: http://aclweb.org/anthology/W17-7619 & http://www.lrec-conf.org/proceedings/lrec2016/summaries/761.html.

CELEX2

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees