Data and meta-data relevant to understanding as texts the files in the Penn TreeBank (LDC Catalog entry LDC99T42) and the Penn Discourse TreeBank (LDC Catalog entry LDC99T42), can be found in the the TIPSTER WSJ corpus (LDC Catalog entry LDC93T3A). (The same information can be found in the ACL/DCI corpus, LDC Catalog entry LDC93T1.) This information can be accessed indirectly using map files that pair each Penn TreeBank file name (eg, wsj_0005) with its corresponding index in the TIPSTER WSJ corpus (eg, 891031-0011). While these map files are publically available via the LDC catalog entry for the Penn TreeBank, the LDC are now permitting us to make both meta-data and data directly available to current and future license holders of the Penn Discourse TreeBank and the Penn TreeBank.
The meta-data for a Penn TreeBank file comprise the set of header fields from its TIPSTER entry, explained as follows in the TIPSTER WSJ sample text:
Where the HL field goes over multiple lines in the original TIPSTER entry,
we have removed the line breaks. So where the TIPSTER file has:
<HL> Technology & Health: @ Asbestos Once Used in Kent Filters Led @ To Workers' Cancer Deaths, Group Says @ ---- @ By Anne Newman @ Staff Reporter of The Wall Street Journal </HL>the HL field for file wsj_0003.meta appears as:
There are two sorts of data related to text structure:
Both these meta-data and data can be of value to discourse researchers. The meta-data can, for example, enable the texts to be distinguished by genre (news reports, editorials, etc. [Webber, 2009] or by topic [Petrenz and Webber, 2011]. These can then be used, for example, in text segmentation and text summarization, or in testing hypotheses about domain adaptation [Plank and van Noord, 2011]. The data, on the other hand, can allow researchers to distinguish separate texts within a single file (e.g. the four separate letters to the editor in file wsj_0105, or the two separate TIPSTER articles, each with its own meta-data, that were included in error in the same Penn TreeBank file (eg, wsj_0814) and thereby avoid, for example, attempting to produce one summary for the entire file.
This new resource created from the TIPSTER files employs the same file structure and conventions used in the Penn TreeBank and the PDTB 2.0. The meta-data and data for a single Penn TreeBank / PDTB file (wsj_XXXX) reside in a corresponding file (wsj_XXXX.meta). All the files corresponding to a given section (XX) are in sub-directory XX. The tarball distribution contains the 25 sub-directories 00 through 24.
Each individual file starts with the meta-data from its corresponding
article in the TIPSTER corpus, followed by a list headed SBREAKS
of the byte positions of section breaks present in the file. For
example:
DOCNO : 891102-0087. DD : = 891102 AN : 891102-0087. HL : Letters to the Editor:@ Brutal World of Life on the Streets DD : 11/02/89 SO : WALL STREET JOURNAL (J) SBREAKS : 1988..1989;2857..2858;3536..3537;4077..4078
Those files that, in error, contain more than one article from the TIPSTER
corpus have two copies of the above data separated by ARTICLEBREAK, as for
example wsj_0545:
DOCNO : 891030-0156. DD : = 891030 AN : 891030-0156. HL : Canadian Pig Herd Shrinks DD : 10/30/89 SO : WALL STREET JOURNAL (J) CO : CANDA DATELINE : OTTAWA ARTICLEBREAK : 212..213 DOCNO : 891030-0155. DD : = 891030 AN : 891030-0155. HL : Who's News:@ American Federal Savings Bank of Duval County DD : 10/30/89 SO : WALL STREET JOURNAL (J) CO : AMJX WNEWS