Global Yoruba Lexical Database v. 1.0

Authors: Yiwola Awoyale
Release Date: Dec 19, 2008
Data Type: lexicon
Data Source(s): dictionaries
Application(s): instruction, language teaching, machine translation, sociolinguistics
Language(s): English, Gullah, Lucumi, Trinidadian, Yoruba
Language ID(s): eng, gul, luq, trf, yor
The Global Yoruba Lexical Database v. 1.0 is a set of related dictionaries providing definitions and translations for over 450,000 words from the Yoruba language and its variants: Standard Yoruba (over 368,000 words), Gullah (over 3,600 words), Lucumí (over 8,000 words) and Trinidadian (over 1,000 words).

Yoruba is a Niger-Congo language (sub classification: Kwa > Yoruboid) spoken natively by nearly 20 million people, the vast majority of them in southwestern Nigeria. There are also approximately a half million Yoruba speakers in Benin, as well as speakers in Togo and Ghana and among the emigrant populations in the United States and the United Kingdom. In addition, roughly two million people in Nigeria speak Yoruba as a second language.

The Yoruba language diaspora is wide, stretching from southwestern Nigeria and Benin westward to the Caribbean and islands along the southeastern United States coast. Yoruba and other African dialects arrived in the Americas and the Caribbean as a consequence of the Atlantic slave trade. Throughout the region, Yoruba dialects blended with each other and with languages like Spanish and French to form a variety of creoles such as Gullah in the United States and Nagô in Brazil. Many of those creoles have become the language of liturgy and music in Cuba, Brazil, Argentina, Trinidad, Jamaica and parts of the United States and Canada. The ultimate goal of this dictionary is to provide coverage for all Yoruba dialects across the globe. For that reason, it will continue to be a work in progress.

The current standard orthography is tone-driven. Yoruba has three tones: a high tone, a middle tone and a low tone. Each syllable in a Yoruban word must have at least one tone and long vowels may have two tones. While there are no explicit rising or falling tones, combinations of the languages three basic tones may produce the same effect. Grammatically, Yoruba is a Subject-Verb-Object (SVO) language. Verbs have no infinitive forms, past or present tense and typically have only a single syllable. Discrete auxiliary words provide information on the verb tense. Nor do Yoruba nouns have plural or singular form their number derives from the context in which the word occurs.

The Yoruba dialect continuum consists of over fifteen varieties, with considerable phonological and lexical differences among them and some grammatical ones as well. Peripheral areas of dialectal regions often have some similarities to adjoining dialects. Standard Yoruba is a koine used for education, writing, broadcasting, and contact between speakers of different dialects. It is also called Literary Yoruba, common Yoruba, or simply Yoruba without qualification. Though in large part based on the Ň?yň? and Ibadan dialects, it incorporates several features from other dialects and has a simplified vowel harmony system and some other features not found in other Yoruba dialects.


This release encompasses the following languages and dialects:

Languages Description Number of words


This dictionary of Standard Yoruba contains detailed lexicographic entries which include the part of speech, the English definition of the Yoruba headword, cross references, examples in English and the morphemic decomposition of the Yoruba headword. 142,389


This dictionary maps the English headword back to Standard Yoruba and includes the part of speech, Yoruba definition, and morphemic decomposition of the Yoruba word. 226,585

Gullah->English and Yoruba

Gullah is a creole spoken in the coastal Low Country of South Carolina and Georgia in the United States. Although the language is no longer spoken to a great extent, its words are still commonly used for personal names and nicknames. The dictionary translates from Gullah headwords to English and to Standard Yoruba. 3,636

Lucumí->Spanish, English and Yoruba

Lucumí is the ritual language of the Santeria religion practiced in Cuba. The Lucumí dictionary translates from a Lucumí headword to Cuban Spanish to English to Standard Yoruba. At the time of this publication in 2008, some entries do not have complete translations and only map from Lucumí to Cuban Spanish. 8,075

Trinidadian->English and Yoruba

Trinidadian is a creole which blends English, French, Spanish and African languages. The Trinidadian dictionary presents those words that have Yoruban roots and maps from the Trinidadian headword to English and Standard Yoruba. 1,187

The dictionaries in this publication are presented in two formats, Toolbox databases and XML. Short for The Field Linguists Toolbox, Toolbox is a lexicographical database system published by SIL. SIL makes Toolbox freely available for download. In order to use the Global Yoruba Lexical Database v. 1.0, Toolbox must first be installed on the users local computer.

The orthography of the text in the databases conforms to that presented to students in the Nigerian school system. The basic Yoruba alphabet is:

a b d e e? f g gb h i j k l m n o o? p r s s? t u w y

The letter gb is a digraph, two letters that combine to form a single phoneme. In written Yoruba, gb functions as a single letter. In the Toolbox presentation, this has been taken into account and the software sorts the words accordingly in all functions. The XML presentation has been sorted according to the above alphabet but is a static, flat file. For that reason, developers creating applications from the XML files will need to take into account the digraph when writing searching and reporting functions. As Yoruba is a tonal language, the written language uses additional diacritic marks to denote tones. The orthography uses three tones:

  • Low: denoted with a grave symbol () as in ŕ
  • Mid: plain letter without diacritics
  • High: denoted with an acute (´) symbol as in á

Both the Toolbox and XML presentations encode the text in Unicode UTF-8 using normalized form C. Unicode normalized forms govern the order in which letters and characters are composed and processed by software systems. Normalized form C is the standard form used by most web systems and is a W3C standard for the web. The Toolbox presentation uses the Aria Unicode MS font for display. The Tahoma and Lucida Grande fonts will also display the Yoruba alphabet under UTF-8 encoding. Since XML only provides information about document structure, fonts are not specified in the XML versions of the dictionaries.

Displaying non-Western letters:Windows users will need to install and configure their computers for Extended Language support. To do this, open the Windows Control Panel and click the Regional and Language Options icon. In the Regional and Language Options window that opens, select the Languages pane. Under the Supplemental Language Support section, check both check boxes and click okay. Windows will as for your install disc and will install the modules needed to properly display complex and non Western letters. If users do not have their Windows install disc, they should contact their local system administrator to install Extended Language Support.


For an example of the data in this database, please review this sample entry (jpg) from the Yoruba-English Lexicon.

