Data Type: Text Text Type: Journalistic (newswire service) Domain: International news Languages: German General Description: The dpa_ger/ directory contains Deutsche Presse Agentur (German Press Agency) newswire articles. The text is collected via an AP Datafeatures telephone line installed at the Linguistic Data Consortium of the University of Pennsylvania. The format uses a labeled bracketing, expressed in the style of SGML (Standard Generalized Markup Language). We use the ISO 8859-1 (Latin1) character set for special characters such as letters with umlauts. Availability: CD-ROM Related Corpora: Institution of Origin: Linguistic Data Consortium Publisher and Place of Publication: Deutsche Presse-Agentur GmbH, Mittelweg 38, 20148 Hamburg Collection Time Span: 1993-1996 File organization: one file per day. Due to occasional reception problems, files may occasionally contain several days of material, shrinking or replacing files from nearby dates. Also, the "day" does not always start precisely at midnight. The TRAILER fields should indicate transmission time fairly reliably, however. Total size (compressed): 91MB German Tagging Description: The philosophy in the formatting has been to preserve as much of the original structure as possible, but to provide enough consistency to allow simple decoding of the data. Although there are some regularities in some of the header information (author, headline, etc.) that could be exploited, such information may not be consistent. Thus we have only attempted to show structure that could be extracted with reasonable consistency. The standard data structures are illustrated in the below sample. dpger960701.0052 x0061 &D3; &D1; BC-RUSSLAND
0062 07-01 0085
&UR; BC-RUSSLAND, 0062 &QL; &UR; Jelzin traf mit Tschernomyrdin zusammen &QC;

Moskau (dpa) - Der russische Präsident Boris Jelzin, über dessen Gesundheitszustand es Spekulationen gibt, ist am Montag mit Regierungschef Viktor Tschernomyrdin zusammengetroffen. Dies teilte Kremlsprecher Sergej Medwedjew mit.

Tschernomyrdin habe Jelzin über den G-7-Wirtschaftsgipfel in Lyon informiert. Die beiden Politiker hätten auch die Lage in Rußland vor der Stichwahl um das Präsidentenamt am kommenden Mittwoch erörtert. dpa ba ln AP-NY-07-01-96 0502EDT &QL; Every document is bracketed by tags and has a unique document number, bracketed by tags. Each beginning tag starts as the first character of a new line, but the ending tags could be on the same line or on later lines. The next several fields are extracted from the standard newswire service header: The STORYID is a code assigned by the newswire service; repetitions within the same day may indicate repetitions, continuations, or follow-up articles. The "cat", "pri", and "sel" attributes correspond to the wire service's encoding of "category" and "priority", and a general "selector code". The FORMAT code is usually "&D3; &D1;" as above, but specially-formatted articles may have different FORMAT codes. The "SLUG" is supposed to be a quick keyword summary with no spaces; several articles on the same subject may have the same slug. However, especially in earlier material, we detected many malformed slugs; in these cases, we omit the SLUG field and include the suspicious "slug" as the first word of the HEADER field. The HEADER field contains several optional fields; this newswire does not appear to use them very consistently. The HEADER may therefore contain: the first two or three words of the headline, one or more approximate wordcounts (generally a 4-digit number), and one or more repetitions of the date (usually number-hyphen-number, e.g. "5-1", "10-31"), in addition to other information such as version or reference fields. The PREAMBLE is simply the first several lines of the body text, which often contains a repeat of the slug and word-count, a headline, and occasional bylines. Detection of the boundary between this preliminary information and the actual body of the article (TEXT) is not straightforward, so users should expect some errors. In particular, occasionally the "PREAMBLE" may contain a daily news summary, with little or no interesting "TEXT" afterward -- since such articles tend to use non-standard formatting, making it difficult to correct identify the start of main body text. Both in the PREAMBLE and in the TEXT, we have left intact several formatting instructions, in the belief that they may indicate useful information: &QL; &QC; &QR; special line enders All mean "hard line break" plus some formatting information (L=flush Left, C=Center, R=flush Right) &TL; preformatted line Tabular or other preformatted material. Line break at end of line should be kept. &UR; "Upper Rail": boldface Appears in body text as well as headers. &LR; end boldface Seems to be used only in body text. In headers, perhaps bold ends at "&Q[LCR];". For instance, headlines often, but not always, start with &UR; and end with &QC;, while a repeated slug might end with &QL;. In the body text, emphasized text may start with &UR; and end with &LR;. Both as part of the philosophy of leaving the data as close to the original as possible, and because it is impossible to check all the data manually, there are many "errors" in the data. These range from errors in the original data, such as noise in the newswire tranmission, or other typographical errors, to potential errors in the reformatting done at the LDC. The error-checking has concentrated on allowing readability of the data rather than on correcting content. This means that there have been automated checks for control characters and other easy indicators of noise, for correct matching of the beginning and end tags, and for complete DOC and DOCNO fields. The types of "errors" remaining include fragment sentences, strange formatting around tables or other "non-textual" items, misspellings, mismatched or variant quotation marks, missing fields (that are generally missing from the data), etc. Note that in many instances, lines with tabular material begin with "&TL;" as described above, but this may not always be the case.