CETEMPúblico:

version 1.7, distributed by the Linguistic Data Consortium (LDC)

Last update: 6 August 2001

CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/Público) is a corpus of approximately 180-milion words of newspaper text from the Portuguese daily newspaper PÚBLICO. It has been compiled for research and development in natural language processing (NLP) by the project Computational processing of Portuguese, under an agreement signed by PÚBLICO and the Portuguese Ministry of Science and Technology (MCT) in April 2000.

Apart from the LDC distribution, CETEMPúblico is made available in the following ways:

online on the Web, through the project site (http://www.portugues.mct.pt/). Address for querying the corpus: http://corpora.portugues.mct.pt/. Page for updated information: http://www.linguateca.pt/cetempublico/informacoes.html.
in one CD in text format (version 1.0), sent by mail free of charge to whoever registers in the information page mentioned above
in two CDs in CQP format for use with the IMS-CWB corpus processing system (Corpus Workbench from the Institut for Maschinelle Sprachverarbeitung, University of Stuttgart)

The present distribution contains version 1.7 of CETEMPúblico, created in Oslo on 6 August 2001 in order to conform to SGML encoding.

The corpus is in 196 compressed text files, with a name of the form cetemXXX.gz, from cetem001.gz to cetem196.gz.

FAQ - Frequently Asked Questions

Who are the envisaged users of CETEMPúblico?

This corpus was mainly aimed at all those who develop computer programs processing the Portuguese language, and who would need raw material for their work. The text versions on CD were conceived for this kind of users.

On the other hand, we want the corpus to be useful to everyone who studies the Portuguese language and wishes to check their hypotheses in previously organized text material. The online and the CQP versions are meant for such users, who are, in any case, also welcome to get it on CD in order to process the corpus locally, possibly by means of the corpus processing system of their choice.

What is PÚBLICO?

PÚBLICO is a widely read daily Portuguese newspaper. It was founded in 1991 and was the first newspaper in Portugal to make available an online edition on the Web, Publico.pt.

Are there any restrictions to the use of CETEMPúblico?

As stated in the User Conditions file, CETEMPúblico can be used for research and technological development. Only its direct commercial exploitation is not allowed.

What are my duties as a user of CETEMPúblico?

The Público newspaper should always be acknowledged as source of the material, in any presentation of work that make use of CETEMPúblico, such as articles, theses and talks.

A free copy of any commercial products emerging from R&D projects using CETEMPúblico should be given to the PÚBLICO newspaper.

Am I allowed to reconstruct the full newspaper articles?

No. The agreement signed between MCT and PÚBLICO forced us to chop up the articles into extracts and shuffle them so that no reconstruction were possible. The corpus is not supposed to replace the newspaper's archives.

Does CETEMPúblico include all the text published by PÚBLICO?

No. On the one hand, several editions were missing in the material provided by the newspaper, and we excluded newspaper sections not considered relevant for the goals of the corpus, such as quotations from other Portuguese newspapers ("Diz-se"), the errata section ("O PÚBLICO errou"), and sports results in table format (classifications, rankings, results, etc.). On the other hand, CETEMPúblico includes a large number of articles that were not actually published by lack of space or opportunity.

Is the language of CETEMPúblico exclusively European Portuguese?

The vast majority is Portuguese from Portugal, although there are a few texts of Brazilian and African writers.

What is included in CETEMPúblico?

The corpus includes the text of around 2,600 editions of PÚBLICO, written between 1991 and 1998, amounting to approximately 180 million words.

CETEMPúblico 1.7 contains 1,504,258 extracts (CETEMPúblico 1.0 had 1,567,625), bearing the information about section of origin and semester. Each extract is divided in paragraphs and sentences, and titles and authors are marked as such. See some examples of extracts.

How were the words counted?

Tokens containing at least one letter or digit were considered words. Punctuation marks were not considered words.

Some approximate numbers:

Tokens Types

Units 229,038,019 1,033,041

Words 191,687,833 999,059

Punctuation 13,065,151 33,982

"Punctuation" includes tokens with punctuation marks, such as (1993), a) or 17:53.

Structure Number

Extracts <ext> 1,504,258

Paragraphs <p> 2,571,735

Sentences <s> 7,082,094

Titles <t> 655,059

Authors <a> 247,392

List elements <li> 80,060

What is the corpus structure?

We specify the corpus structure with the help of a small BNF grammar. Terminals appear in bold:

corpus = <corpus> extract+ </corpus> extract = extract_id extract_contents </ext> extract_contents = paragraph+ paragraph = title | author_id | <p> sentence+ </p> | list_element title = <t> token+ </t> author_id = <a> token+ </a> list_element = <li> token+ </li> sentence = ( <s> | <s tipo=frag> ) token+ </s> token = | palavra | sinal_pontuação | identificador X = ( *+ ) | *+ extract_id = <ext n=number sec=sec_id sem=semester > number = [0-9]+ sec_id= soc | pol | clt | des | opi | eco | com | clt-soc | pol-soc | nd semester = 91a | 91b | 92a | 92b | 93a | 93b | 94a | 94b | 95a | 95b | 96a | 96b |97a | 97b | 98a | 98b

Notes:

The parentheses and the * in the definition of X are terminals (as opposed to all other occurrences).
number ranges from 1 to 1567625 and is unique (some numbers no longer exist).
palavra (word), sinal_pontuação (punctuation mark) and identificador (identifier) in the above grammar are not further analysed (this is left to a Portuguese tokenizer).

Alternatively, we provide a DTD for SGML parsers.

Is CETEMPúblico going to be tagged and/or parsed?

We are currently working on this, and plan to grant access to the parsed version through the AC/DC project.

Do the characters strictly reflect newspaper usage?

In some cases we made normalization decisions (the original material was encoded in Macintosh characters, while we chose the ISO-8859-1 character encoding standard). Some of the changes performed are:

Long dash was transformed into "--" (a sequence of two hyphens).
All quotes are encoded as « or ».
The "oe ligature" character was transformed into the sequence of the letters O and E as usual in ISO-8859-1 encoding.
The decimal character 127 (hexadecimal 7F) was replaced by hyphen.
The (few) cases of << and >> were transformed into their one-character equivalent, namely « and »
The characters &, < and > were translated into the corresponding SGML entities, namely &, < and >.

Is there more information about CETEMPúblico?

You can read more about this corpus in two articles, available here in electronic form:

Paulo Rocha & Diana Santos. "CETEMPúblico: Um corpus de grandes dimensões de linguagem jornalística portuguesa", in Maria das Graças Volpe Nunes (ed.), Actas do V Encontro para o processamento computacional da língua portuguesa escrita e falada, PROPOR'2000 (Atibaia, São Paulo, Brasil, 19 a 22 de Novembro de 2000), pp. 131-140: RTF, PostScript, PDF
Diana Santos & Paulo Rocha. "Evaluating CETEMPúblico, a free resource for Portuguese", in Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, ACL'2001 (Toulouse, 9-11 July 2001), pp.442-449: RTF, PostScript, PDF

Is all material included in CETEMPúblico in a valid format?

Although this was not the case with the previous versions, we have checked that this is true as far as version 1.7 is concerned.

Are there other known problems in CETEMPúblico?

There are some repeated articles (and consequently repeated extracts). Although from version 1.2 on we have tried to eliminate duplicated extracts (keeping just the first), there remain in the corpus cases of slightly different articles, which we take to be different versions of the same text. See an example of similar extracts here.
Paragraphs identified as titles or authors were always joined to the previous paragraph before extract separation. The articles were divided based on the criteria "two paragraphs", and some articles included several short news ("Breves"). Therefore, not only some (sub)titles are separated from the news they refer to, but in some cases they were joined to a completely different piece of information. See one case of incorrect title separation here.

See also our ACL'2001 paper for precision and recall on structural markup concerning titles, author identification and sentence separation.

How can I remain updated about future CETEMPúblico changes?

Whenever we learn about new problems with the corpus, we try to create patches to solve them. They will be available from CETEMPúblico's page. We will also update the corpus version to which we give access on the Web. So far (for users of version 1.0), we have made available 6 patches in Perl, named patch_cetempublico_1.0.x.pl that may be downloaded from the information page.

In order to remain updated about the corpus progress, you can also subscribe to the CETEMPúblico mailing list by sending a mail to projecto@informatics.sintef.no.

Acknowledgements

At PÚBLICO, we heartily thank José Vítor Malheiros, director of the electronic version, without whom the corpus would not exist, and Paulo Almeida for technical support concerning the newspaper files.
We are grateful to Stefan Evert and Arne Fitschen (University of Stuttgart) for help and support as far as the IMS Corpus Workbench is concerned.
We thank Pedro Veiga for starting the whole project from the MCT side, as well as providing administrative facilities for the burning and distribution of the first batch of CDs.
We thank Miguel Andrade for having carried out the legal work necessary for the project.
We thank José João Dias de Almeida for valuable suggestions to handle the repeated extract problem.
And finally, we thank Andrew Cole at LDC for help in validating the present version.

Contact the CETEMPúblico compilers at projecto@informatics.sintef.no

	Tokens	Types
Units	229,038,019	1,033,041
Words	191,687,833	999,059
Punctuation	13,065,151	33,982

Structure	Number
Extracts <ext>	1,504,258
Paragraphs <p>	2,571,735
Sentences <s>	7,082,094
Titles <t>	655,059
Authors <a>	247,392
List elements <li>	80,060