This file contains documentation on the ANC Second Release, Linguistic Data Consortium (LDC) catalog number LDC2005T35 and ISBN 1-58563-369-0.
The American National Corpus (ANC) project fosters the development of a corpus comparable to the British National Corpus (BNC), covering American English. Corpus-analytic work has demonstrated that the BNC is inappropriate for the study of American English, due to the numerous differences in use of the language.
The availability of a corpus of American English will significantly contribute to language and linguistic research, the development of language understanding computer applications (e.g., language translation and search and retrieval software), and the compilation of reference works such as dictionaries and thesauri. It will also provide a rich national resource for use in education at all levels.
ANC Second Release contains over 20 million words: 10+ million words added in the Second Release, and a new corrected and validated version of the 11 million word ANC First Release. The Second Release also contains software for searching and retrieving multiple stand-off annotations.
ANC Second Release contains texts from the following sources (* denotes new source in the Second Release):
- Transcribed telephone speech (LDC and Project MORE)
- The New York Times
- Berlitz Travel Guides (Langensheidt Publishers)
- Slate Magazine (Microsoft)
- ICIC Corpus of Fundraising Texts (Indiana Center for Intercultural Communication)*
- The Michigan Corpus of Academic Spoken English (MICASE) (University of Michigan, English Language Institute)*
- Various non-fiction
- Various fiction (Orin Hargraves, Ferd Eggan)*
- Various medical research articles (BioMed Central, Public Library of Science)*
- Anonymized posts to the Phoenix Board/Buffistas.org*
ANC Second Release contains data governed under two types of licenses, an open license and a restricted license. Both the Open License Agreement and the Restricted License Agreement need to be signed in order to receive ANC Second Release, and the data must be used in acordance with the agreement by which it is governed.
The ANC will ultimately contain a core corpus of at least 100 million words, including both written and spoken (transcripts) data comparable across genres to the BNC. The genres in the ANC will be expanded to include new types of language data that have become available in recent years, such as web blogs and web pages, chats, email, and rap music lyrics. In addition to the core 100 million words, the ANC will include an additional component of potentially several hundreds of millions of words, chosen to provide both the broadest and largest selection of data possible.
The American National Corpus is being developed with the help of consortium of publishers of American English dictionaries and companies with interests in language processing was formed in 1999. Consortium members are providing materials for inclusion in the corpus, and provided initial financial support for the project.
Additional documentation and information is available at the ANC web site at http://americannationalcorpus.org/2ndrelease.html.
For examples of the various types of data in this corpus, please review the files listed below.
The publication of this corpus was facilitated by funding extended by the TalkBank project. TalkBank is an interdisciplinary research project funded by a five-year grant (BCS-98009, KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the University of Pennsylvania.
Portions © 2002 New York Times, © 2003 Langenscheidt Publishers, © 1996-2000 Microsoft, Inc., © 1999, 2001, 2003 Oxford University Press, © 2003 Word, Inc., © 1998-2005 Orin Hargraves, © 2004 Ferd Eggan, © 2003 Indiana Center for Intercultural Communication, © 1999-2002, English Language Institute, the University of Michigan, © 2003, 2005 American National Corpus Project, © 1993, 1997, 2003, 2005 Trustees of the University of Pennsylvania