Top-level README file for the Hansard Corpus
============================================

The Hansard Corpus consists of parallel texts in English and Canadian
French, originating from official records of the proceedings of the
Canadian Parliament.  While the content is therefore limited to
legislative discourse, it will be found nonetheless to span a broad
assortment of topics, and the stylistic range includes spontaneous
discussion and some written correspondance along with legislative
propositions and prepared speeches.

The collection presented here has been assembled by the LDC by way of
archives received from two distinct secondary sources.  The primary
source for these materials was of course the Canadian Parliament, but
material from one time period of parliamentary proceedings was
acquired by the IBM T.J. Watson Research Center, while material from
another period was acquired by Bell Communications Research Inc.
(Bellcore).  The time period covered by the Bellcore archive is
reflected in the names of some data files in the archive, and spans
from April 1986 through December 1988.  The LDC was not given any
explicit indication of the time period covered by the IBM archive, but
a cursory scan of dates mentioned in these texts suggests that they
span a range from the mid-1970's to the mid-1980's, and that they do
not ovlap in time with the Bellcore set.

Aside from covering different time periods, the two archives were
found to have quite different organization, and have undergone quite
different amounts and kinds of processing in being prepared as a
parallel language resource.  In addition, the Bellcore set itself
comprised two rather distinct types of data -- one appeared to be the
main parliamentary proceedings while the other consisted of committee
hearings -- and these two subsets required very different kinds of
processing to make them suitable for general use as parallel text.

For these reasons, three sets have been kept distinct in this
publication, and are identified simply as "set_a" (the IBM archive),
"set_b" (the Bellcore main proceedings), and "set_c" (the Bellcore
committee hearings).  Each set is described in greater detail in a
separate documentation file, "set_a.doc", "set_b.doc" and "set_c.doc".

In terms of what the three sets have in common:

 - They are rendered here using the 8-bit ISO-Latin1 character encoding
standard.

 - They use a minimal amount of SGML tagging to identify sentences or
paragraphs (the markup will be described in detail below).

 - All sets are organized using a parallel file structure, in which
the content of a given English text file is matched by the content of
a corresponding French text file.

 - The SGML text files for set_a and set_c are published in compressed
form, by way of the public-domain (freeware) GNU-Zip utility (gzip);
the compressed files all have ".gz" at the end of their file names.
(The set_b files are not compressed, for reasons explained in
set_b.doc.)

 - Each set tends to contain some very long lines of text, because the
publication format places one complete sentence, or one complete
paragraph, on a line (i.e. each sentence or paragraph is terminated by
a single line-feed (new-line) character - ASCII 0x0A (012 octal) -
with no line-feed or carriage-return characters internal to a sentence
or paragraph).  A general-purpose "line wrapping" script (written in
Perl) is provided that will break up long lines to a uniform width for
display or printing -- the script is in the file named "linewrap.pl"
in the doc directory.

UNIX users beware: line lengths in these files are likely to exceed
the internal buffer limits of some "standard" UNIX utilities such as
"awk", with the result that some lines will be truncated on output, or
the utilities will fail and exit with an error condition.  GNU
versions of these utilities are available -- e.g. "gawk" -- that avoid
this problem; also, the "line wrapping" Perl script mentioned above
can be used to make the data acceptable for programs with line-length
limits.

Additional tools for working with general text and bitext (parallel)
data may be found at Dan Melamed's web page:

	http://www.cis.upenn.edu/~melamed/

together with additional technical reports and papers on bitext
processing and related fields.