File: afp.doc -------------- Data Type: Text Text Type: Journalistic (newswire services) Domain: International news Language: Portuguese (Brazil) General Description: The Agence France Presse (AFP) newswire service provides articles in six languages (French, German, Spanish, Portuguese, Arabic and English), which are supplied on six separate data streams collected via a Dateno MKII satellite receiver and associated equipment at the Linguistic Data Consortium of the University of Pennsylvania. The AFP Portuguese text data included in this corpus was processed by Henry Thompson of HCRC at the University of Edinburgh. The Portuguese language news service actually includes some data in Spanish. Henry Thompson, who developed the software to transform the AFP data from transmission format to SGML/Latin1 format, incorporated a rudimentary check of language content into the process, and has applied an SGML tagging approach to identify the language being used on an article-by-article basis. On the basis of this tagging, the Spanish articles have been filtered out of the collection presented on this CD-ROM. However, it is possible that the language identification logic may have erred in some circumstances, leading to the mistaken inclusion of some Spanish text data. There was also a general difficulty associated with AFP data involving intermittent transmission noise in each of the data streams, resulting in corruption of the text content. Many of the symptoms associated with this corruption were identified and eliminated from the collection, but some forms of corruption may have gone undetected, such as random loss of characters from the stream or garbling of portions within articles, yielding "printable" but nonsensical content. We are reasonably confident that these less detectable forms of corruption typically occurred in combination with the identifiable symptoms, so that having filtered out those symptoms, most if not all the data corruption has been removed. Institution of Origin: Linguistic Data Consortium, University of Pennsylvania, Phil., PA 19104 Publisher and Place of Publication: Agence France Presse 13 place de la Bourse 75002 Paris, France Collection Time Span: 1994-1998 File organization: one file per day. Due to occasional reception problems, files may occasionally contain several days of material, shrinking or replacing files from nearby dates. Also, the "day" does not always start precisely at midnight. The TRAILER fields should indicate transmission time fairly reliably, however. Total Size: 254MB Portuguese Contact for Questions or to Report Errors: ldc@ldc.upenn.edu