Top-level README file for the Hansard Corpus ============================================ The Hansard Corpus consists of parallel texts in English and Canadian French, originating from official records of the proceedings of the Canadian Parliament. While the content is therefore limited to legislative discourse, it will be found nonetheless to span a broad assortment of topics, and the stylistic range includes spontaneous discussion and some written correspondance along with legislative propositions and prepared speeches. The collection presented here has been assembled by the LDC by way of archives received from two distinct secondary sources. The primary source for these materials was of course the Canadian Parliament, but material from one time period of parliamentary proceedings was acquired by the IBM T.J. Watson Research Center, while material from another period was acquired by Bell Communications Research Inc. (Bellcore). The time period covered by the Bellcore archive is reflected in the names of some data files in the archive, and spans from April 1986 through December 1988. The LDC was not given any explicit indication of the time period covered by the IBM archive, but a cursory scan of dates mentioned in these texts suggests that they span a range from the mid-1970's to the mid-1980's, and that they do not ovlap in time with the Bellcore set. Aside from covering different time periods, the two archives were found to have quite different organization, and have undergone quite different amounts and kinds of processing in being prepared as a parallel language resource. In addition, the Bellcore set itself comprised two rather distinct types of data -- one appeared to be the main parliamentary proceedings while the other consisted of committee hearings -- and these two subsets required very different kinds of processing to make them suitable for general use as parallel text. For these reasons, three sets have been kept distinct in this publication, and are identified simply as "set_a" (the IBM archive), "set_b" (the Bellcore main proceedings), and "set_c" (the Bellcore committee hearings). Each set is described in greater detail in a separate documentation file, "set_a.doc", "set_b.doc" and "set_c.doc". In terms of what the three sets have in common: - They are rendered here using the 8-bit ISO-Latin1 character encoding standard. - They use a minimal amount of SGML tagging to identify sentences or paragraphs (the markup will be described in detail below). - All sets are organized using a parallel file structure, in which the content of a given English text file is matched by the content of a corresponding French text file. - The SGML text files for set_a and set_c are published in compressed form, by way of the public-domain (freeware) GNU-Zip utility (gzip); the compressed files all have ".gz" at the end of their file names. (The set_b files are not compressed, for reasons explained in set_b.doc.) - Each set tends to contain some very long lines of text, because the publication format places one complete sentence, or one complete paragraph, on a line (i.e. each sentence or paragraph is terminated by a single line-feed (new-line) character - ASCII 0x0A (012 octal) - with no line-feed or carriage-return characters internal to a sentence or paragraph). A general-purpose "line wrapping" script (written in Perl) is provided that will break up long lines to a uniform width for display or printing -- the script is in the file named "linewrap.pl" in the doc directory. UNIX users beware: line lengths in these files are likely to exceed the internal buffer limits of some "standard" UNIX utilities such as "awk", with the result that some lines will be truncated on output, or the utilities will fail and exit with an error condition. GNU versions of these utilities are available -- e.g. "gawk" -- that avoid this problem; also, the "line wrapping" Perl script mentioned above can be used to make the data acceptable for programs with line-length limits. Additional tools for working with general text and bitext (parallel) data may be found at Dan Melamed's web page: http://www.cis.upenn.edu/~melamed/ together with additional technical reports and papers on bitext processing and related fields.