|LDC Catalog No.:||LDC93T1|
|Data Source(s):||journal articles, dictionaries, newswire|
|Application(s):||natural language processing, language modeling, information retrieval|
|Online Documentation:||LDC93T1 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||ACL/DCI LDC93T1. Web Download. Philadelphia: Linguistic Data Consortium, 1993.|
The many formats in which the originals of these texts came have all, to one extent or another, been mapped into a markup language consistent with the SGML standard (ISO 8879).
The format of the material from the Wall Street Journal uses a labelled bracketing, expressed in the style of SGML, although no formal SGML DTD is provided. The tag set has been modified by turning the Dow Jones header categories into tags and by creating ad hoc tags such as "". The original datelines are presented as separate text units; the text is divided and tagged into paragraphs and sentences with each sentence presented on a single line. Nothing has been done to modify the typographical methods used to subdivide headlines and stories into sections, nor are any of the text features within sentences (quotes, ellipsis, etc.) normalized.
The Collins English Dictionary is present in two forms. One form was approximately parsed into fielded records as an exercise in learning a language called "FIT", by a student working under the direction of Lloyd Nakatani at ATT Bell Laboratories during the summer of 1990. The original digital image of the typographer's tape that the database version was prepared from had serious flaws that were not detected and corrected until later; the corrected version, a clean typographer's tape, is presented in a separate directory. A properly-analyzed database version will be provided in the future. The documentation includes notes developed during the new attempt to analyze the tape from scratch.
The Department of Energy abstracts reside in files that are approximately one megabyte each. The original 950 separators have been replaced with newlines and space padding between articles was removed. An acronym dictionary that was extracted from the database as an indication of the material's topic areas has been included in a separate directory.
Provisional material from the Penn Treebank project is divided into two subdirectories on this disk. The subdirectory "postext" contains text with part-of-speech annotations; "parstext" contains text with syntactic bracketing.