ECI Multilingual Text

Item Name: ECI Multilingual Text
Author(s): Linguistic Data Consortium
LDC Catalog No.: LDC94T5
ISBN: 1-58563-033-0
ISLRN: 511-168-567-582-5
Member Year(s): 1994
DCMI Type(s): Text
Data Source(s): news magazine, journal articles, dictionaries, broadcast news, broadcast conversation, newswire, varied
Application(s): machine translation, language modeling, information retrieval
Language(s): Turkish, Swedish, Slovenian, Russian, Portuguese, Norwegian, Norwegian Bokmål, Norwegian Nynorsk, Lithuanian, Latin, Japanese, Scottish Gaelic, French, Estonian, English, Modern Greek (1453-), German, Danish, Bulgarian, Tosk Albanian, Standard Malay, Spanish, Serbian, Northern Uzbek, Mandarin Chinese, Italian, Dutch, Czech, Croatian, Albanian
Language ID(s): tur, swe, slv, rus, por, nor, nob, nno, lit, lat, jpn, gla, fra, est, eng, ell, deu, dan, bul, als, zsm, spa, srp, uzn, cmn, ita, nld, ces, hrv, sqi
License(s): ECI/MCI Agreement
Le Monde Material User Agreement
Online Documentation: LDC94T5 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Linguistic Data Consortium. ECI Multilingual Text LDC94T5. Web Download. Philadelphia: Linguistic Data Consortium, 1994.

The first release of the European Corpus Initiative, the Multilingual Corpus 1 (ECI/MCI), has 46 subcorpora in 27 (mainly European) languages. The total size of these is roughly 92 million (lexical) words. The corpora are marked up using TEI P2 conformant SGML (to varying levels of detail), with easy access to the source text without markup. Twelve of the component corpora are multilingual parallel corpora with from two to nine sub-corpora. All the alphabetic corpora (there is some Japanese and Chinese) are encoded in the ISO LATIN family of 8-bit character sets (ISO 8859-1, -5 and -7). The CD-ROM is in High Sierra format (ISO 9660), readable on UNIX, MSDOS and Apple systems at least.

The amount of material per language varies, from about 36 million words (German) to about 5 thousand words (Bulgarian). The majority of sources are journalistic in nature (newspapers, magazines, broadcasts) additional sources include dictionaries (Albanian, Gaelic, Turkish, Japanese/English), literature, technical reports and proceedings or publications of international organizations. The table on the next page lists the languages included, the subcorpus numbers for each language (in parentheses) and the amount of data per language in thousands of lexical words.

Language (Subcorpus #) Kwords Totals
German (70) 34291 (09) 191 (65) 20 (28) 187 (29) 59 (30) 76 (47) 24 (59) 50 (71) 21 (70A) 999 35918
French (31) 4775 (04) 4121 (28) 187 (29) 59 (30) 76 (47) 24 (51) 6 (59) 50 (71) 21 (32) 1667 10986
Spanish (31) 4500 (13) 830 (14) 1041 (15) 447 (47) 24 (32) 1667 8 (59) 50 (71) 8580
English (31) 4222 (36) 1141 (74) 95 (28) 187 (47) 24 (51) 6 (56) 97 (59) 50 (71) 21 (32) 1667 7510
Dutch (03) 5500 (02) 600 (47) 24 (71) 21 6145
Czech (44) 4726 4726
Italian (11) 3518 (42) 303 (58) 13 (29) 59 (30) 76 (47) 24 (71) 21 4014
Chinese (78) 2895 2895
Greek (10) 2515 (47) 24 (59) 50 (71) 21 2610
Norwegian (41) 2226 2226
Swedish (37) 1718 1718
Serb/Croat/Slov(24) 700 (56) 289 989
Tibetan (76) 834 834
Portuguese (60) 675 (47) 24 (71) 21 720
Malay (80) 563 563
Russian (73) 364 364
Japanese (57) 203 203
Turkish (20) 173 (20A) 110 283
Albanian (82) 205 205
Gaelic (55) 141 141
Estonian (39) 100 100
Usbek (81) 88 88
Latin (74) 75 75
Danish (47) 24 (71) 21 45
Lithuanian (89) 20 20
Bulgarian (84) 5 5
Total 91969

Available Media

View Fees

Member
Non-Member
Reduced-License
Extra Copy
Login for the applicable fee