Data Type: Text
Text Type: Journalistic (newswire service)
Domain: International news
Languages: German


General Description:

The dpa_ger/ directory contains Deutsche Presse Agentur (German Press
Agency) newswire articles. The text is collected via an AP Datafeatures
telephone line installed at the Linguistic Data Consortium of the
University of Pennsylvania.

The format uses a labeled bracketing, expressed in the style of SGML
(Standard Generalized Markup Language).  We use the ISO 8859-1
(Latin1) character set for special characters such as letters with
umlauts.


Availability: CD-ROM

Related Corpora:

Institution of Origin: Linguistic Data Consortium

Publisher and Place of Publication: Deutsche Presse-Agentur GmbH,
Mittelweg 38, 20148 Hamburg

Collection Time Span: 1993-1996

File organization: one file per day.
   Due to occasional reception problems, files may occasionally contain
several days of material, shrinking or replacing files from nearby dates.
Also, the "day" does not always start precisely at midnight.  The TRAILER
fields should indicate transmission time fairly reliably, however.

Total size (compressed): 91MB German


Tagging Description:

The philosophy in the formatting has been to preserve as much of the
original structure as possible, but to provide enough consistency to allow
simple decoding of the data.  Although there are some regularities in some
of the header information (author, headline, etc.) that could be exploited,
such information may not be consistent.  Thus we have only attempted to
show structure that could be extracted with reasonable consistency.

The standard data structures are illustrated in the below sample.

<DOC>
<DOCID> dpger960701.0052 </DOCID>
<STORYID cat=c pri=u sel=dparo> x0061 </STORYID>
<FORMAT> &D3; &D1; </FORMAT>
<SLUG> BC-RUSSLAND </SLUG>
<HEADER> 0062   07-01 0085 </HEADER>
<PREAMBLE>
 &UR; BC-RUSSLAND, 0062 &QL; 
 &UR; Jelzin traf mit Tschernomyrdin zusammen  &QC; 
</PREAMBLE>
<TEXT>
<p>
   Moskau (dpa) - Der russische Präsident Boris Jelzin, über dessen
Gesundheitszustand es Spekulationen gibt, ist am Montag mit
Regierungschef Viktor Tschernomyrdin zusammengetroffen. Dies teilte
Kremlsprecher Sergej Medwedjew mit.
<p>
   Tschernomyrdin habe Jelzin über den G-7-Wirtschaftsgipfel in Lyon
informiert. Die beiden Politiker hätten auch die Lage in Rußland vor
der Stichwahl um das Präsidentenamt am kommenden Mittwoch erörtert.
dpa ba ln
</TEXT>
<TRAILER>
AP-NY-07-01-96 0502EDT &QL; 
</TRAILER>
</DOC>

Every document is bracketed by <DOC> </DOC> tags and has a unique document
number, bracketed by <DOCID> </DOCID> tags.  Each beginning tag starts as
the first character of a new line, but the ending tags could be on the same
line or on later lines.

The next several fields are extracted from the standard newswire service
header:
   The STORYID is a code assigned by the newswire service; repetitions
within the same day may indicate repetitions, continuations, or follow-up
articles.  The "cat", "pri", and "sel" attributes correspond to the wire
service's encoding of "category" and "priority", and a general "selector
code".
   The FORMAT code is usually "&D3; &D1;" as above, but specially-formatted
articles may have different FORMAT codes.
   The "SLUG" is supposed to be a quick keyword summary with no spaces;
several articles on the same subject may have the same slug.  However,
especially in earlier material, we detected many malformed slugs; in these
cases, we omit the SLUG field and include the suspicious "slug" as the
first word of the HEADER field.
   The HEADER field contains several optional fields; this newswire does
not appear to use them very consistently.  The HEADER may therefore
contain: the first two or three words of the headline, one or more
approximate wordcounts (generally a 4-digit number), and one or more
repetitions of the date (usually number-hyphen-number, e.g. "5-1",
"10-31"), in addition to other information such as version or reference
fields.

The PREAMBLE is simply the first several lines of the body text, which
often contains a repeat of the slug and word-count, a headline, and
occasional bylines.  Detection of the boundary between this preliminary
information and the actual body of the article (TEXT) is not
straightforward, so users should expect some errors.  In particular,
occasionally the "PREAMBLE" may contain a daily news summary, with little
or no interesting "TEXT" afterward -- since such articles tend to use
non-standard formatting, making it difficult to correct identify the start
of main body text.

Both in the PREAMBLE and in the TEXT, we have left intact several
formatting instructions, in the belief that they may indicate useful
information:
  &QL; &QC; &QR; special line enders
        All mean "hard line break" plus some formatting information
        (L=flush Left, C=Center, R=flush Right)
  &TL;  preformatted line
        Tabular or other preformatted material.  Line break at end of line
        should be kept.
  &UR;  "Upper Rail": boldface
        Appears in body text as well as headers.
  &LR;  end boldface
        Seems to be used only in body text.  In headers, perhaps bold ends
        at "&Q[LCR];".
For instance, headlines often, but not always, start with &UR; and end with
&QC;, while a repeated slug might end with &QL;.  In the body text,
emphasized text may start with &UR; and end with &LR;.

Both as part of the philosophy of leaving the data as close to the original
as possible, and because it is impossible to check all the data manually,
there are many "errors" in the data.  These range from errors in the
original data, such as noise in the newswire tranmission, or other
typographical errors, to potential errors in the reformatting done at the
LDC.

The error-checking has concentrated on allowing readability of the data
rather than on correcting content.  This means that there have been
automated checks for control characters and other easy indicators of noise,
for correct matching of the beginning and end tags, and for complete DOC
and DOCNO fields.  The types of "errors" remaining include fragment
sentences, strange formatting around tables or other "non-textual" items,
misspellings, mismatched or variant quotation marks, missing fields (that
are generally missing from the data), etc.

Note that in many instances, lines with tabular material begin with "&TL;"
as described above, but this may not always be the case.