Home › Language Resources › Data

ACL/DCI

Item Name:	ACL/DCI
Author(s):	Linguistic Data Consortium
LDC Catalog No.:	LDC93T1
ISBN:	1-58563-000-4
ISLRN:	663-248-563-590-7
DOI:	https://doi.org/10.35111/vdfv-av77
Member Year(s):	1993
DCMI Type(s):	Text
Data Source(s):	dictionaries, journal articles, newswire
Project(s):	GALE, TIDES
Application(s):	information retrieval, language modeling, natural language processing
Language(s):	English
Language ID(s):	eng
License(s):	ACL/DCI Agreement
Online Documentation:	LDC93T1 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Linguistic Data Consortium. ACL/DCI LDC93T1. Web Download. Philadelphia: Linguistic Data Consortium, 1993.
Related Works: Hide	View hasAnnotation LDC2000T43 BLLIP 1987-89 WSJ Corpus Release 1 hasContinuation LDC95T7 Treebank-2 LDC99T42 Treebank-3

Introduction

ACL Data Collection Initiative contains text from the Wall Street Journal, the Collins English Dictionary, scientific abstracts provided by the U.S. Department of Energy and a variety of grammatically tagged and parsed materials from the Treebank project at the University of Pennsylvania. The total amount of uncompressed text is 620 Mbytes.

Data

The many formats of the original texts have been mapped into a markup language consistent with the SGML standard (ISO 8879).

The format of the material from the Wall Street Journal uses a labelled bracketing, expressed in the style of SGML, although no formal SGML DTD is provided. The tag set has been modified by turning the Dow Jones header categories into tags and by creating ad hoc tags such as "". The original datelines are presented as separate text units; the text is divided and tagged into paragraphs and sentences with each sentence presented on a single line. Nothing has been done to modify the typographical methods used to subdivide headlines and stories into sections, nor are any of the text features within sentences (quotes, ellipsis, etc.) normalized.

The Collins English Dictionary is present in two forms. One form was approximately parsed into fielded records as an exercise in learning a language called "FIT", by a student working under the direction of Lloyd Nakatani at ATT Bell Laboratories during the summer of 1990. The original digital image of the typographer's tape that the database version was prepared from had serious flaws that were not detected and corrected until later; the corrected version, a clean typographer's tape, is presented in a separate directory. A properly-analyzed database version will be provided in the future. The documentation includes notes developed during the new attempt to analyze the tape from scratch.

The Department of Energy abstracts reside in files that are approximately one megabyte each. The original 950 separators have been replaced with newlines and space padding between articles was removed. An acronym dictionary that was extracted from the database as an indication of the material's topic areas has been included in a separate directory.

Provisional material from the Penn Treebank project is divided into two subdirectories on this disk. The subdirectory "postext" contains text with part-of-speech annotations; "parstext" contains text with syntactic bracketing.

ACL/DCI

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees