CONTENT AND PROPERTIES OF "SET_A" DATA (the IBM archive) -------------------------------------------------------- This data set is the result of a process carried out at IBM that sought to establish parallelism at the sentence level. It is therefore organized as a series of sentences, identified by sequential numeric indices, with corresponding files in English and French containing the corresponding sets of sentences. The initial sentence index (in files a_000_e and a_000_f) is "1", and the last sentence index (in files a_286_e and a_286_f) is "2869040". The sentence numbering establishes the parallelism -- two sentences having the same number are purported to be parallel in content. This parallelism of sentence numbering does not hold true for the Bellcore sets, set_b and set_c; in those files, the tagging and numbering applies only to paragraphs, not to sentences (though many paragraphs contain only one sentence), and paragraphs with the same number are _NOT_ purported to have parallel content (though they may happen to be parallel in many cases). Each set_a file contains up to 10,000 sentences (a handful of sentences have been dropped from the overall sequence, so some files may contain one or two sentences fewer than the rest). The three digits in the file name represent the first three digits of the sentence indices contained in the file. (Leading zeros are applied for the first 100 file names, though these leading zeros are not used in the sentence indexing within the files). The full set of files has also been partitioned into three sub-directories, named according to the first three characters of each file name ("a_0", "a_1", "a_2"). The process that established parallelism at the sentence level had the side effect of eliminating other information, such as paragraph boundaries, dates of sessions, and source language of the original material (i.e. which language was used in Parliament for a given sentence, and which language represents the translation). These forms of information have been retained in the Bellcore sets. Each line of each file begins with an SGML tag of the form: where "#" represents a one- to seven-digit sentence index number. Following the closing angle bracket is a space character and then the sentence itself, which consists of tokens separated by single space characters. Tokens may be words, abbreviations, numerics, or ellipses (...), along with various forms of punctuation, quotation and parentheses, as would be found in common written usage. Each sentence is terminated by a single line-feed character.