CETEMPúblico:

version 1.7, distributed by the Linguistic Data Consortium (LDC)

Computational processing of Portuguese

Last update: 6 August 2001


CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/Público) is a corpus of approximately 180-milion words of newspaper text from the Portuguese daily newspaper PÚBLICO. It has been compiled for research and development in natural language processing (NLP) by the project Computational processing of Portuguese, under an agreement signed by PÚBLICO and the Portuguese Ministry of Science and Technology (MCT) in April 2000.

Apart from the LDC distribution, CETEMPúblico is made available in the following ways:

  1. online on the Web, through the project site (http://www.portugues.mct.pt/). Address for querying the corpus: http://corpora.portugues.mct.pt/. Page for updated information: http://www.linguateca.pt/cetempublico/informacoes.html.
  2. in one CD in text format (version 1.0), sent by mail free of charge to whoever registers in the information page mentioned above
  3. in two CDs in CQP format for use with the IMS-CWB corpus processing system (Corpus Workbench from the Institut for Maschinelle Sprachverarbeitung, University of Stuttgart)
The present distribution contains version 1.7 of CETEMPúblico, created in Oslo on 6 August 2001 in order to conform to SGML encoding.

The corpus is in 196 compressed text files, with a name of the form cetemXXX.gz, from cetem001.gz to cetem196.gz.


FAQ - Frequently Asked Questions

Who are the envisaged users of CETEMPúblico?

This corpus was mainly aimed at all those who develop computer programs processing the Portuguese language, and who would need raw material for their work. The text versions on CD were conceived for this kind of users.

On the other hand, we want the corpus to be useful to everyone who studies the Portuguese language and wishes to check their hypotheses in previously organized text material. The online and the CQP versions are meant for such users, who are, in any case, also welcome to get it on CD in order to process the corpus locally, possibly by means of the corpus processing system of their choice.

What is PÚBLICO?

PÚBLICO is a widely read daily Portuguese newspaper. It was founded in 1991 and was the first newspaper in Portugal to make available an online edition on the Web, Publico.pt.

Are there any restrictions to the use of CETEMPúblico?

As stated in the User Conditions file, CETEMPúblico can be used for research and technological development. Only its direct commercial exploitation is not allowed.

What are my duties as a user of CETEMPúblico?

The Público newspaper should always be acknowledged as source of the material, in any presentation of work that make use of CETEMPúblico, such as articles, theses and talks.

A free copy of any commercial products emerging from R&D projects using CETEMPúblico should be given to the PÚBLICO newspaper.

Am I allowed to reconstruct the full newspaper articles?

No. The agreement signed between MCT and PÚBLICO forced us to chop up the articles into extracts and shuffle them so that no reconstruction were possible. The corpus is not supposed to replace the newspaper's archives.

Does CETEMPúblico include all the text published by PÚBLICO?

No. On the one hand, several editions were missing in the material provided by the newspaper, and we excluded newspaper sections not considered relevant for the goals of the corpus, such as quotations from other Portuguese newspapers ("Diz-se"), the errata section ("O PÚBLICO errou"), and sports results in table format (classifications, rankings, results, etc.). On the other hand, CETEMPúblico includes a large number of articles that were not actually published by lack of space or opportunity.

Is the language of CETEMPúblico exclusively European Portuguese?

The vast majority is Portuguese from Portugal, although there are a few texts of Brazilian and African writers.

What is included in CETEMPúblico?

The corpus includes the text of around 2,600 editions of PÚBLICO, written between 1991 and 1998, amounting to approximately 180 million words.

CETEMPúblico 1.7 contains 1,504,258 extracts (CETEMPúblico 1.0 had 1,567,625), bearing the information about section of origin and semester. Each extract is divided in paragraphs and sentences, and titles and authors are marked as such. See some examples of extracts.

How were the words counted?

Tokens containing at least one letter or digit were considered words. Punctuation marks were not considered words.

Some approximate numbers:

Tokens Types
Units229,038,019 1,033,041
Words 191,687,833 999,059
Punctuation 13,065,151 33,982

"Punctuation" includes tokens with punctuation marks, such as (1993), a) or 17:53.

StructureNumber
Extracts <ext> 1,504,258
Paragraphs <p> 2,571,735
Sentences <s> 7,082,094
Titles <t> 655,059
Authors <a> 247,392
List elements <li> 80,060

What is the corpus structure?

We specify the corpus structure with the help of a small BNF grammar. Terminals appear in bold:

corpus = <corpus> extract+ </corpus>
extract = extract_id extract_contents </ext>
extract_contents = paragraph+
paragraph = title | author_id | <p> sentence+ </p> | list_element
title = <t> token+ </t>
author_id = <a> token+ </a>
list_element = <li> token+ </li>
sentence = ( <s> | <s tipo=frag> ) token+ </s>
token = | palavra | sinal_pontuação | identificador
X = ( *+ ) | *+
extract_id = <ext n=number sec=sec_id sem=semester >
number = [0-9]+
sec_id= soc | pol | clt | des | opi | eco | com | clt-soc | pol-soc | nd
semester = 91a | 91b | 92a | 92b | 93a | 93b | 94a | 94b | 95a | 95b | 96a | 96b |97a | 97b | 98a | 98b

Notes:

Alternatively, we provide a DTD for SGML parsers.

Is CETEMPúblico going to be tagged and/or parsed?

We are currently working on this, and plan to grant access to the parsed version through the AC/DC project.

Do the characters strictly reflect newspaper usage?

In some cases we made normalization decisions (the original material was encoded in Macintosh characters, while we chose the ISO-8859-1 character encoding standard). Some of the changes performed are:

Is there more information about CETEMPúblico?

You can read more about this corpus in two articles, available here in electronic form:

Is all material included in CETEMPúblico in a valid format?

Although this was not the case with the previous versions, we have checked that this is true as far as version 1.7 is concerned.

Are there other known problems in CETEMPúblico?

See also our ACL'2001 paper for precision and recall on structural markup concerning titles, author identification and sentence separation.

How can I remain updated about future CETEMPúblico changes?

Whenever we learn about new problems with the corpus, we try to create patches to solve them. They will be available from CETEMPúblico's page. We will also update the corpus version to which we give access on the Web. So far (for users of version 1.0), we have made available 6 patches in Perl, named patch_cetempublico_1.0.x.pl that may be downloaded from the information page.

In order to remain updated about the corpus progress, you can also subscribe to the CETEMPúblico mailing list by sending a mail to projecto@informatics.sintef.no.


Acknowledgements


Contact the CETEMPúblico compilers at projecto@informatics.sintef.no