ECI Multilingual Text
Item Name: | ECI Multilingual Text |
Author(s): | Linguistic Data Consortium |
LDC Catalog No.: | LDC94T5 |
ISBN: | 1-58563-033-0 |
ISLRN: | 511-168-567-582-5 |
DOI: | https://doi.org/10.35111/h2vd-p896 |
Member Year(s): | 1994 |
DCMI Type(s): | Text |
Data Source(s): | news magazine, journal articles, dictionaries, broadcast news, broadcast conversation, newswire, varied |
Application(s): | machine translation, language modeling, information retrieval |
Language(s): | Swedish, Slovenian, Russian, Portuguese, Norwegian Bokmål, Norwegian Nynorsk, Lithuanian, Latin, Japanese, Scottish Gaelic, French, Estonian, English, Modern Greek (1453-), German, Danish, Bulgarian, Tosk Albanian, Spanish, Serbian, Mandarin Chinese, Italian, Dutch, Czech, Croatian, Albanian, Uzbek, Malay |
Language ID(s): | swe, slv, rus, por, nob, nno, lit, lat, jpn, gla, fra, est, eng, ell, deu, dan, bul, als, spa, srp, cmn, ita, nld, ces, hrv, sqi, uzb, msa |
License(s): |
ECI/MCI Agreement Le Monde Material User Agreement |
Online Documentation: | LDC94T5 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Linguistic Data Consortium. ECI Multilingual Text LDC94T5. Web Download. Philadelphia: Linguistic Data Consortium, 1994. |
Related Works: | View |
The first release of the European Corpus Initiative, the Multilingual Corpus 1 (ECI/MCI), has 46 subcorpora in 27 (mainly European) languages. The total size of these is roughly 92 million (lexical) words. The corpora are marked up using TEI P2 conformant SGML (to varying levels of detail), with easy access to the source text without markup. Twelve of the component corpora are multilingual parallel corpora with from two to nine sub-corpora. All the alphabetic corpora (there is some Japanese and Chinese) are encoded in the ISO LATIN family of 8-bit character sets (ISO 8859-1, -5 and -7). The CD-ROM is in High Sierra format (ISO 9660), readable on UNIX, MSDOS and Apple systems at least.
The amount of material per language varies, from about 36 million words (German) to about 5 thousand words (Bulgarian). The majority of sources are journalistic in nature (newspapers, magazines, broadcasts) additional sources include dictionaries (Albanian, Gaelic, Turkish, Japanese/English), literature, technical reports and proceedings or publications of international organizations. The table on the next page lists the languages included, the subcorpus numbers for each language (in parentheses) and the amount of data per language in thousands of lexical words.
Language (Subcorpus #) Kwords Totals
German (70) 34291 (09) 191 (65) 20 (28) 187 (29) 59 (30) 76 (47) 24 (59) 50 (71) 21 (70A) 999 35918
French (31) 4775 (04) 4121 (28) 187 (29) 59 (30) 76 (47) 24 (51) 6 (59) 50 (71) 21 (32) 1667 10986
Spanish (31) 4500 (13) 830 (14) 1041 (15) 447 (47) 24 (32) 1667 8 (59) 50 (71) 8580
English (31) 4222 (36) 1141 (74) 95 (28) 187 (47) 24 (51) 6 (56) 97 (59) 50 (71) 21 (32) 1667 7510
Dutch (03) 5500 (02) 600 (47) 24 (71) 21 6145
Czech (44) 4726 4726
Italian (11) 3518 (42) 303 (58) 13 (29) 59 (30) 76 (47) 24 (71) 21 4014
Chinese (78) 2895 2895
Greek (10) 2515 (47) 24 (59) 50 (71) 21 2610
Norwegian (41) 2226 2226
Swedish (37) 1718 1718
Serb/Croat/Slov(24) 700 (56) 289 989
Tibetan (76) 834 834
Portuguese (60) 675 (47) 24 (71) 21 720
Malay (80) 563 563
Russian (73) 364 364
Japanese (57) 203 203
Turkish (20) 173 (20A) 110 283
Albanian (82) 205 205
Gaelic (55) 141 141
Estonian (39) 100 100
Usbek (81) 88 88
Latin (74) 75 75
Danish (47) 24 (71) 21 45
Lithuanian (89) 20 20
Bulgarian (84) 5 5
Total 91969