Home › Language Resources › Data

American National Corpus (ANC) Second Release

Item Name:	American National Corpus (ANC) Second Release
Author(s):	Randi Reppen, Nancy Ide, Keith Suderman
LDC Catalog No.:	LDC2005T35
ISBN:	1-58563-369-0
ISLRN:	797-978-576-065-6
DOI:	https://doi.org/10.35111/251h-g440
Release Date:	December 15, 2005
Member Year(s):	2005
DCMI Type(s):	Text
Data Source(s):	journal articles, news magazine, newswire, telephone speech, varied, web collection
Project(s):	American National Corpus (ANC), Talkbank
Application(s):	natural language processing
Language(s):	English
Language ID(s):	eng
License(s):	American National Corpus 2nd Release - Open American National Corpus 2nd Release - Restricted
Online Documentation:	LDC2005T35 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Reppen, Randi, Nancy Ide, and Keith Suderman. American National Corpus (ANC) Second Release LDC2005T35. Web Download. Philadelphia: Linguistic Data Consortium, 2005.
Related Works: Hide	View hasAnnotation LDC2010T22 Manually Annotated Sub-Corpus First Release LDC2013T12 Manually Annotated Sub-Corpus Third Release

Introduction

American National Corpus (ANC) Second Release was developed by various contributors and contains approximately 22 million words of American English text from multiple genres with various annotation such as part-of-speech (POS) tagging.

The American National Corpus (ANC) project fosters the development of a corpus comparable to the British National Corpus (BNC), covering American English. Corpus-analytic work has demonstrated that the BNC is inappropriate for the study of American English, due to the numerous differences in use of the language. The ANC is being developed with help from a consortium of American English dictionary publishers and companies interested in language processing that was formed in 1999. Consortium members are providing materials for inclusion in the corpus, and provided initial financial support for the project.

The availability of a corpus of American English will significantly contribute to language and linguistic research, the development of language understanding computer applications (e.g., language translation and search and retrieval software), and the compilation of reference works such as dictionaries and thesauri. It will also provide a rich national resource for use in education at all levels.

Data

In addition to the more than 10 million words added in the Second Release, this corpus contains a new corrected and validated version of the 11 million word ANC First Release and software for searching and retrieving multiple stand-off annotations.

ANC Second Release contains texts from the following sources (* denotes new source in the Second Release):

Transcribed telephone speech
The New York Times
Berlitz Travel Guides
Slate Magazine
ICIC Corpus of Fundraising Texts *
The Michigan Corpus of Academic Spoken English (MICASE) *
Various non-fiction
Various fiction *
Various medical research articles *
Anonymized posts to the Phoenix Board/Buffistas.org *

The corpus includes the data as a UTF-16 encoded file plus annotations of the documents such as automatic POS tagging with two different types of tagsets, automatic noun and verb phrase identification, and stuctural information at the paragraph and sentence level. The goal of the ANC is to ultimately contain a core corpus of at least 100 million words, including both written and spoken data (transcripts) comparable across genres to the BNC.

ANC Second Release contains data governed under two types of licenses, an open license and a restricted license. Both the Open License Agreement and the Restricted License Agreement need to be signed in order to receive ANC Second Release, and the data must be used in accordance with the agreement by which it is governed.

Additional documentation and information is available at the ANC web site.

Samples

For examples of the data in this corpus, please review this plain text sample (TXT) and its POS annotation with Penn tagset (XML).

Updates

None at this time.

Sponsorship

The publication of this corpus was facilitated by funding extended by the TalkBank project. TalkBank is an interdisciplinary research project funded by a five-year grant (BCS-98009, KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the University of Pennsylvania.

Copyright

Portions © 2002 New York Times, © 2003 Langenscheidt Publishers, © 1996-2000 Microsoft, Inc., © 1999, 2001, 2003 Oxford University Press, © 2003 Word, Inc., © 1998-2005 Orin Hargraves, © 2004 Ferd Eggan, © 2003 Indiana Center for Intercultural Communication, © 1999-2002, English Language Institute, the University of Michigan, © 2003, 2005 American National Corpus Project, © 1993, 1997, 2003, 2005 Trustees of the University of Pennsylvania