Documentation for TREC Spanish

Introduction

This publication contains the TREC Spanish Corpus produced by the Linguistic Data Consortium (LDC); catalog number LDC2000T51, isbn 1-58563-177-9. This is the set of documents used for the Spanish task in TRECs 3-5. It consists of approximately 250 megabytes of the Mexican newspaper El Norte and 300 megabytes of Agence France Presse 1994 newswire text formatted to include TREC document ids. The El Norte documents were used for TRECs 3-4, and the Agence France Presse documents for TREC 5. The topics (questions) and relevance judgments (right answers) that complete the test collections can be downloaded from the TREC web site (http://trec.nist.gov) in the Data/Non-English section.

Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus: http://www.ldc.upenn.edu/Catalog/LDC2000T51.html.

Please look at file.tbl for the directory structure of this publication, as well as a complete list of files.

Data Structure

The files in the afp_text and infosel_data subdirectories are ASCII encoded SGML files that conform to the afp_trec.dtd and infosel.dtd files found in the doc subdirectory. Examples from each subdirectory include afp_text/af940512 and infosel_data/ism_001.

Content Copyright

Portions © 1994-1996 Agence France Presse
Portions © 1994 El Norte


Contact: ldc@ldc.upenn.edu
© 1996-2000 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.