CONTENT AND PROPERTIES OF "SET_C" DATA (the Bellcore archive, part2) -------------------------------------------------------------------- The original form of this data set, drawn from transcripts of various parliamentary committee meetings, was a set of five directories, each containing a set of consecutively numbered files. Pairs of files that were adjacent in sequence within each directory were found to have parallel content; the files contained a typesetting format different from the set_b collection, along with a fairly regular pattern of introductory material at the beginning of each file, giving some description about the file content and/or about the committee personnel and the nature of the committee meeting; also, files typically ended with a list of the names of persons present at the meeting. Within the main text portion of each file, change of speaker was regularly marked in a typical manner, as illustrated here: Mr. Smith: I would like to ask... Mr. Jones: I would say that... When one speaker went on at length without interruption, the text was typically broken up into separate paragraphs with the speaker's name appearing only at the start of the first paragraph. There was never any indication as to the original language being spoken during the meetings, but more than half of the files were found to contain alternating segments of English and French -- that is, by looking at just one file in a pair, one would gather that the people in the meeting were switching back and forth between languages over the course of the session. When looking at the other file in the pair, the same series of alternating segments was found, but with the languages inverted. In order to publish this collection in a manner comparable to the other data sets, we first renamed the files so the two members of a pair would have the same file number, followed by "_e" or "_f". While we had no indication of the purpose (if any) for the original division of files into five directories, we chose to retain this division of the data. We then filtered out the typesetting markup to produce an SGML format that is basically equivalent to set_b; that is, the main text content is tagged at paragraph boundaries, the tag and full text of each paragraph are presented on a single line, and the paragraphs are all identified by a sequential index number. (Note that the term "paragraph boundary" here includes the notion of speaker turn boundaries as well.) There are a few notable differences from the set_b SGML format: - The paragraph numbers are strictly sequential; the numbering was assigned at the last stage before publication. - There are no tags anywhere in the text, since this information was not reliably available. - Every file begins with a tag, followed by a
tag; the "header" portion of each file contains the introductory material that preceded the text data in the original format; this section is terminated with a
tag, followed immediately by the initial paragraph tag and text, "

...", on the very next line. - The last paragraph tag and text in the file is followed by a "" tag. Some of the more arcane (or possibly corrupted) forms of the original typesetting markup may have been left untreated by this stage of filtering; some word tokens may be found with digit characters or outlining indexes ("i","ii",etc) attached. The lists of names at the end of each file were not given any special treatment -- they appear as a series of paragraphs with one name and title per paragraph. After conversion to SGML, each file was passed through a simple algorithm to identify the language used in each paragraph. Lists of paragraph id's and their corresponding language id's were then checked by both automatic and manual means to determine which files contained language cross-overs, and to establish the consistency of cross-over points in related pairs. This stage involved some manual corrections to file contents, some deletion of paragraphs or regions where parallelism was disrupted, and elimination of some file pairs from the collection because of their general difficulty or unsuitability. Once the files with cross-overs were reconciled so that their cross-over points were reasonably assured to be equivalent and parallel, the corresponding file pairs were submitted to a "de-shuffling" process that placed all the English text in one output file and all the French in another. As a final step, the paragraph id's in these output files were reset to be strictly sequential. Relatively less attention was paid to the files that did not contain language cross-overs, but all files were checked to determine that the relative sizes of corresponding pairs were reasonably consistent, and that there were no duplicated files in the set. As with the set_b data, there are cases where paragraph divisions are not made the same way in the two files of a pair, and cases where one or more paragraphs in one file do not appear in the other file. Based on the overall results of the "de-shuffling" process, however, it is likely that the great majority of the set_c data will show very consistent alignment at the paragraph level.