ECI Multilingual Text


Item Name: ECI Multilingual Text
Authors: LDC
LDC Catalog No.: LDC94T5
ISBN: 1-58563-033-0
Data Type: text
Data Source(s): broadcast conversation, broadcast news, dictionaries, journal articles, news magazine, newswire, varied
Application(s): information retrieval, language modeling, machine translation
Language(s): Albanian, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, French, Gaelic, German, Italian, Japanese, Latin, Lithuanian, Mandarin Chinese, Modern Greek, Northern Uzbek, Norwegian, Norwegian Bokmaal, Norwegian Nynorsk, Portuguese, Portuguese, Russian, Serbian, Slovenian, Spanish, Standard Malay, Swedish, Turkish
Language ID(s): als, bul, dan, deu, ell, eng, est, fra, gla, jpn, lat, lit, nno, nob, nor, por, por, rus, slv, swe, tur
Distribution: 1 DVD, Web Download
Member fee: $0 for 1994 members
Non-member Fee: US $75.00
Reduced-License Fee: US $75.00
Extra-Copy Fee: US $75.00
Non-member License: yes
Member License: yes
Readme File: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: LDC
1994
ECI Multilingual Text
Linguistic Data Consortium, Philadelphia

The first release of the European Corpus Initiative, the Multilingual Corpus 1 (ECI/MCI), has 46 subcorpora in 27 (mainly European) languages. The total size of these is roughly 92 million (lexical) words. The corpora are marked up using TEI P2 conformant SGML (to varying levels of detail), with easy access to the source text without markup. Twelve of the component corpora are multilingual parallel corpora with from two to nine sub-corpora. All the alphabetic corpora (there is some Japanese and Chinese) are encoded in the ISO LATIN family of 8-bit character sets (ISO 8859-1, -5 and -7). The CD-ROM is in High Sierra format (ISO 9660), readable on UNIX, MSDOS and Apple systems at least.

The amount of material per language varies, from about 36 million words (German) to about 5 thousand words (Bulgarian). The majority of sources are journalistic in nature (newspapers, magazines, broadcasts) additional sources include dictionaries (Albanian, Gaelic, Turkish, Japanese/English), literature, technical reports and proceedings or publications of international organizations. The table on the next page lists the languages included, the subcorpus numbers for each language (in parentheses) and the amount of data per language in thousands of lexical words.

Language (Subcorpus #) Kwords Totals German (70) 34291 (09) 191 (65) 20 (28) 187 (29) 59 (30) 76 (47) 24 (59) 50 (71) 21 (70A) 999 35918 French (31) 4775 (04) 4121 (28) 187 (29) 59 (30) 76 (47) 24 (51) 6 (59) 50 (71) 21 (32) 1667 10986 Spanish (31) 4500 (13) 830 (14) 1041 (15) 447 (47) 24 (32) 1667 8 (59) 50 (71) 8580 English (31) 4222 (36) 1141 (74) 95 (28) 187 (47) 24 (51) 6 (56) 97 (59) 50 (71) 21 (32) 1667 7510 Dutch (03) 5500 (02) 600 (47) 24 (71) 21 6145 Czech (44) 4726 4726 Italian (11) 3518 (42) 303 (58) 13 (29) 59 (30) 76 (47) 24 (71) 21 4014 Chinese (78) 2895 2895 Greek (10) 2515 (47) 24 (59) 50 (71) 21 2610 Norwegian (41) 2226 2226 Swedish (37) 1718 1718 Serb/Croat/Slov(24) 700 (56) 289 989 Tibetan (76) 834 834 Portuguese (60) 675 (47) 24 (71) 21 720 Malay (80) 563 563 Russian (73) 364 364 Japanese (57) 203 203 Turkish (20) 173 (20A) 110 283 Albanian (82) 205 205 Gaelic (55) 141 141 Estonian (39) 100 100 Usbek (81) 88 88 Latin (74) 75 75 Danish (47) 24 (71) 21 45 Lithuanian (89) 20 20 Bulgarian (84) 5 5 Total 91969

Content Copyright

Portions 1994 Trustees of the University of Pennsylvania