ASSOCIATION FOR COMPUTATIONAL LINGUISTICS
DATA COLLECTION INITIATIVE

September, 1991

The ACL Data Collection Initiative, as announced in The FINITE STRING, Volume 15, Number 1, March 1989, was founded "to oversee the acquisition and preparation of a large text corpus to be made available for scientific research at cost and without royalties."

Towards this goal, the ACL/DCI has acquired several hundred million words of text, has modified much of it so as to make it more accessible for research purposes, and has distributed tapes containing portions of this data to more than 40 research sites.

Initially run with volunteer labor, using borrowed disk space and computer time, the ACL/DCI has recently been given grants by the General Electric Company and by the National Science Foundation (IRI-9113530) that have permitted us to acquire some of our own equipment, and will permit us to hire a student research assistant to expedite the process of cleaning up the data we have acquired, and to service requests for data more promptly.

As our main method of distribution, we will produce a series of CD-ROMs, of which this is the first. The costs for mastering, pressing and packaging this first CD-ROM were paid for by Dragon Systems Inc.

The many formats in which the originals of these texts came have all, to one extent or another, been mapped into a markup language consistent with the SGML standard (ISO 8879). SGML provides a labelled bracketing of the text, with labels permitted to have associated feature-value pairs. Eventually, ACL/DCI will be furnished with tags conformant to the Text Encoding Initiative standards. Because of time constraints, the files in this initial release are not so conformant, and thus are likely to be re-released eventually in a conformant state. The ACL/DCI welcomes help in establishing "proper" SGML coding for all of its collection.

The value of the ACL/DCI will be enhanced if the results of research using these materials are fed back into the collection. If you have "enhanced" any of the material in ways that you believe would be valuable for other research workers, please contact Mark Liberman (DCI) at the address below before forwarding the data.

Because of restrictions imposed by some of the providers of text, researchers who acquire our data must sign a statement reading as follows:

ACL/DCI USER AGREEMENT

This statement describes the terms of an agreement between the person whose signature is affixed below (hereafter called "the user") and the Association for Computational Linguistics (ACL), in which the user will receive material, as specified below, from the ACL's Data Collection Initiative (ACL/DCI).

The ACL/DCI is an activity which collects machine-readable text for the purpose of scientific and humanistic research, and distributes it at cost and without royalties.

Under this agreement, the user will receive a machine-readable copy of the material specified below. The user agrees that the material received under this agreement will be used only for research purposes. If the user is part of a research group, he or she further agrees to inform everyone in that research group that access to the material requires that the person abide by the terms of this agreement. The user further agrees not to re-distribute the material to others outside of the user's research group.

The user acknowledges that some of the material, as specified below, is subject to copyright restrictions, and that violations of such restrictions may result in legal liability. The user agrees to refrain from violating the copyright restrictions, and to notify all associates who access the material of the copyright restrictions.

In directory WSJ:
Wall Street Journal Materials, Copyright 1987, 1988, 1989 Dow Jones Inc.

In directory CED1:
Collins English Dictionary, Copyright 1979 William Collins Sons & Co. Ltd.

In directory DOE:
Scientific abstracts provided by the U.S. Department of Energy

In directory TREEBANK:
A variety of grammatically tagged and parsed materials from the Treebank project at the University of Pennsylvania, Copyright 1990, 1991 University of Pennsylvania

Copyright for format modifications to any of the materials on this CD-ROM is assigned to the Association for Computational Linguistics.

We interpret the aim of the ACL/DCI User Agreement, and of our efforts in providing this data, as follows:

The aim of the Data Collection Initiative of the Association for Computational Linguistics is to oversee the acquisition and preparation of a large text corpus, to be made available for scientific research without royalties. All copyrighted materials submitted for inclusion in the collection remain the exclusive property of the copyright holders for all other purposes. You should not redistribute the data that you get from us, nor should you sell it, or charge for access to it, or otherwise put it to any direct commercial use. However, commercial application of "analytical materials" derived from the text, such as statistical tables or grammar rules, is explicitly permitted, as long as copyright law is observed.

The copyright holders have been very generous with their donations. It is not our intent to deprive them of any revenues that they should have received in the ordinary course of their business. Thus it would be a violation of trust, as well as a violation of copyright law, for you to republish a dictionary or other work distributed under this agreement, whether in print or electronic form.

If you have an idea for a new product that would infringe on the property rights of one of our benefactors, please communicate directly with them so as to work out a mutually rewarding arrangement for putting your idea into commercial practice.

ACL/DCI MEMBERSHIP

The current members of the ACL/DCI are Robert Amsler (Bellcore), Ken Church (AT&T Bell Laboratories), Ed Fox (Virginia Polytechnic Institute & State University), Carole Hafner (Northeastern University), Judy Klavans (IBM TJ Watson Research Center), Mark Liberman (University of Pennsylvania), Mitch Marcus (University of Pennsylvania), Bob Mercer (IBM TJ Watson Research Center), Jan Pedersen (Xerox PARC), Paul Roossin (IBM TJ Watson Research Center), Don Walker (Bellcore), Susan Warwick (ISSCO), and Antonio Zampolli (University of Pisa). Liberman is chairing the committee.

Since we want the ACL/DCI Collection to grow, people who have text that they can contribute, or ideas about sources for such contributions, should contact Linguistic Data Consortium ( address is given below).

Linguistic Data Consortium
441 Williams Hall
University of Pennsylvania
Philadelphia, PA 19104-6305
(+1 215) 898-0464
ldc@unagi.cis.upenn.edu
FAX: (+1 215) 573-2175x