CONTENT AND PROPERTIES OF "SET_B" DATA (the Bellcore archive, part 1) --------------------------------------------------------------------- This data set was received by the LDC in a "run-off" format, originally intended as input to a common typesetting process (i.e. something like "nroff" or "troff"). Each individual file stores material in one language from one day of parliamentary proceedings. The file name reflects the date and language of the content (with date rendered as "YYMMDD", e.g. 880901_f for the French version of proceedings on September 1, 1988). The text files are partitioned into sub-directories according to the two-digit year ("86", "87", "88"). There are two additional directories, which provide mappings for parallel content in the text data: "tokn_map" and "para_map". These will be explained below. The conversion to SGML format retained information about paragraph boundaries and about the source language of various portions of text. In addition, each paragraph within a file was assigned a sequential index. The resulting SGML form is different from that of the "set_a" data in two regards: 1. there are some lines in each file that contain only an SGML tag, and no text data; these lines are all of the form: English or: French These tags indicate the reported source language of the subsequent material. 2. the lines containing text all begin with the pattern:

where "#" represents the one- to three-digit sequence number assigned to the paragraph; the first paragraph of every file is identified by sequence number "1" -- that is, the paragraph index values are NOT unique across files. Also, the paragraph id numbers are not strictly sequential: there are occasional gaps in the numbering of the paragraphs. With regard to the "" information, this is presented in identical form in both the English and French files; if a given passage of parallel text is preceded by " English " in the English version, it will also be preceded by that same tag in the French version, and the French version is therefore understood to be a translation from the English. After converting the original typesetting format to SGML form, it was discovered that the extent of parallelism in file content, while generally quite good, was disrupted by minor discrepancies in paragraph divisions and inclusion versus exclusion of short (single-sentence) paragraphs. As a result, corresponding files were found to contain different numbers of paragraphs, and among the indexed paragraphs in a given English file, there was a small subset for which the corresponding French file did not contain a matching paragraph, and vice-versa. Given this situation, some additional processing of the "set_b" data was performed by Dan Melamed at the University of Pennsylvania, in order to establish a finer-grained parallelism among the file contents. His application of a "token mapping" algorithm yielded two forms of parallel indexing: paragraph alignment and token correspondence. -- Paragraph alignment of the "set_b" data The "para_map" directory contains one file for each English/French text file pair (e.g. "para_map/880901_p.map" for "88/880901_e" and "88/880901_f"). Each file contains two columns separated by a tab character; the first column contains one or more paragraph index numbers from the English file, and the second contains one or more index numbers from the French file; if there are two or more numbers within a single column of the listing, these are separated by commas (e.g. "81 81,82"). The two columns on a given line represent paragraphs with presumed parallel content. In some cases, two or more paragraph indexes are given in one column to indicate that two or more paragraphs in one language were rendered as a single paragraph in the other. It is also possible that a sequence of one or more paragraphs in one or both files may be skipped in the indexing. The paragraph alignments were actually produced as a derivation from the token alignment process described below. -- Token alignment of the "set_b" data The general strategy and approach of token mapping is described in greater detail in papers presented by Dan Melamed (cf. the PostScript files included in the "doc" directory of the CD-ROM). The "tokn_map" directory contains one file for each English/French text file pair (e.g. "tokn_map/880901_t.map" for "88/880901_e" and "88/880901_f"). Each file contains two columns per line, separated by a tab character. Each column consists of a string containing a paragraph-id and the byte offset to a word token within that paragraph; these two values within the column are separated by a colon. The first column refers to the English file, and the second to the French. For example, the following line from a token map file: 35:299.0 34:362.5 refers to an English word that is centered at 299 bytes into the line that begins with the tag "

", and to a French word that is centered between bytes 362 and 363 in the line that begins with the tag "

"; these two words establish a correspondence in the translation. The byte offset into the paragraph is based on the first character of the line (the open-angle bracket "<") being at offset 0. In order to make it easier to use the mapping information, all the map files and text data for set_b have been published without compression. Obviously, this mapping does not cover all tokens in either language, but it does serve to establish a large quantity of reference points for lexical correspondences. It should be pointed out that Melamed had set parameters in his token-mapping algorithm to trade off some amount of accuracy for greater execution speed when treating this data set. A more careful application of the method (especially with a cleaner version of the source texts) would likely yield a better set of correspondences. The token mapping and paragraph alignment processes revealed some apparent corruptions in a subset of the text files; the origin of the corruption is not known (it appeared in the materials received by the LDC), but the symptom appeared as a "mis-filing" of (portions of) some proceedings. The "mis-filing" showed signs of being due to some software malfunction, whereby a final portion of one text, starting at some arbitrary position, was appended to the end of some other text; sometimes this would result in the same text content appearing in two files, and sometimes the appended material was from the other language. Often, the appended material began in mid-sentence. Surprisingly, there were many cases where both the English and French files for a given session were found to contain appended material that was likewise parallel in content, though the starting points of the appended material were not well aligned. We have tried to locate these corruptions in the SGML text files, and to eliminate material that was fragmented, duplicated elsewhere, or clearly unrelated to material in the corresponding file in the other language. Presumably, some instances of these problems may remain. Because the token correspondences were computed before a number of corruptions in the text files were discovered and corrected, it is possible that some mapping files will contain ranges of false correspondences. We have tried to identify and fix or remove faulty mappings, but some residual errors are likely to have escaped notice.