CONTENT AND PROPERTIES OF "SET_B" DATA (the Bellcore archive, part 1)
---------------------------------------------------------------------
This data set was received by the LDC in a "run-off" format,
originally intended as input to a common typesetting process
(i.e. something like "nroff" or "troff"). Each individual file stores
material in one language from one day of parliamentary proceedings.
The file name reflects the date and language of the content (with date
rendered as "YYMMDD", e.g. 880901_f for the French version of
proceedings on September 1, 1988).
The text files are partitioned into sub-directories according to the
two-digit year ("86", "87", "88"). There are two additional
directories, which provide mappings for parallel content in the text
data: "tokn_map" and "para_map". These will be explained below.
The conversion to SGML format retained information about paragraph
boundaries and about the source language of various portions of text.
In addition, each paragraph within a file was assigned a sequential
index. The resulting SGML form is different from that of the "set_a"
data in two regards:
1. there are some lines in each file that contain only an SGML tag,
and no text data; these lines are all of the form:
where "#" represents the one- to three-digit sequence number
assigned to the paragraph; the first paragraph of every file is
identified by sequence number "1" -- that is, the paragraph index
values are NOT unique across files. Also, the paragraph id numbers
are not strictly sequential: there are occasional gaps in the
numbering of the paragraphs.
With regard to the " ", and to a French word that is
centered between bytes 362 and 363 in the line that begins with the
tag " "; these two words establish a correspondence in the
translation. The byte offset into the paragraph is based on the first
character of the line (the open-angle bracket "<") being at offset 0.
In order to make it easier to use the mapping information, all the map
files and text data for set_b have been published without compression.
Obviously, this mapping does not cover all tokens in either language,
but it does serve to establish a large quantity of reference points
for lexical correspondences.
It should be pointed out that Melamed had set parameters in his
token-mapping algorithm to trade off some amount of accuracy for
greater execution speed when treating this data set. A more careful
application of the method (especially with a cleaner version of the
source texts) would likely yield a better set of correspondences.
The token mapping and paragraph alignment processes revealed some
apparent corruptions in a subset of the text files; the origin of the
corruption is not known (it appeared in the materials received by the
LDC), but the symptom appeared as a "mis-filing" of (portions of) some
proceedings. The "mis-filing" showed signs of being due to some
software malfunction, whereby a final portion of one text, starting at
some arbitrary position, was appended to the end of some other text;
sometimes this would result in the same text content appearing in two
files, and sometimes the appended material was from the other
language. Often, the appended material began in mid-sentence.
Surprisingly, there were many cases where both the English and French
files for a given session were found to contain appended material that
was likewise parallel in content, though the starting points of the
appended material were not well aligned.
We have tried to locate these corruptions in the SGML text files, and
to eliminate material that was fragmented, duplicated elsewhere, or
clearly unrelated to material in the corresponding file in the other
language. Presumably, some instances of these problems may remain.
Because the token correspondences were computed before a number of
corruptions in the text files were discovered and corrected, it is
possible that some mapping files will contain ranges of false
correspondences. We have tried to identify and fix or remove faulty
mappings, but some residual errors are likely to have escaped notice.