Manually Annotated Sub-Corpus Third Release


Item Name: Manually Annotated Sub-Corpus Third Release
Authors: Nancy Ide, Keith Suderman, Collin Baker, Rebecca Passonneau, Christiane Fellbaum
LDC Catalog No.: LDC2013T12
ISBN: 1-58563-647-9
Release Date: Jul 17, 2013
Data Type: text
Data Source(s): email, fiction, government documents, journal entries, newswire, telephone speech, transcribed speech, web collection, weblogs
Project(s): American National Corpus (ANC)
Application(s): natural language processing
Language(s): English
Language ID(s): eng
Distribution: Web Download
Member fee: $0 for 2013 members
Non-member Fee: US $0.00
Reduced-License Fee: N/A
Extra-Copy Fee: US $
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Nancy Ide, et al.
2013
Manually Annotated Sub-Corpus Third Release
Linguistic Data Consortium, Philadelphia

Introduction

Manually Annotated Sub-Corpus (MASC) Third Release was developed as part of The American National Corpus project and consists of approximately 500,000 words of contemporary American English written and spoken data annotated for a wide variety of linguistic phenomena.

The MASC project was established to address, to the extent possible, many of the obstacles to the creation of large-scale, robust, multiply-annotated corpora of English covering a wide range of genres of written and spoken language data. The project provides appropriate data and annotations to serve as the base for a community-wide annotation effort, together with an infrastructure that enables the incorporation of contributed annotations into a single, usable format that can then be analyzed as it is or transduced to any of a variety of other formats. The aim is to offset some of the high costs of producing high quality linguistic annotations via a distribution of effort and to solve some of the usability problems for annotations produced at different sites by harmonizing their representation formats. It also provides data from a much wider variety of genres than are often present in existing multiply-annotated corpora of English, and all of the data in the corpus are drawn from current American English so as to be most useful for natural language processing applications used in the web-based environment. Further information about the pojrect is available at the MASC website.

The source texts were drawn from the open portion of the American National Corpus Second Release, which includes written texts and spoken transcripts of American English from a broad range of genres produced since 1990 and from the Language Understanding Annotation Corpus, a collection of various genres inlcuding broadcast, newswire, email, and telephone speech annotated for committed belief, event and entity coreference, dialog acts and temporal relations.

MASC Third Release includes the the contents of MASC First Release (LDC2010T22) (82,000 words) which is also available from LDC. There is no second release.

Data

All data in this release was annotated for logical structure (paragraph, headings, etc.), token and sentence boundaries, part of speech and lemma, shallow parse (noun and verb chunks) and named entities (person, organization, location and date). Portions of the corpus were also annotated for FrameNet frames (40k full text), Penn Treebank syntax (82k) and opinion (50k). All annotations were either manually produced or hand-validated and represented in ISO-GrAF standoff format. The original texts were derived from original electronic versions in a wide variety of formats, including but not limited to Quark Express, XML, Microsoft Word, Portable Document Format (PDF), HTML, and plain text. Transduction procedures varied depending on the original format.

As little correction or other editorial modification as possible was applied to the text. Corrections to the text were either made in standoff documents containing the corrected version or were reflected in values of segmentation, token, sentence, or other segmental unit, and/or part of speech annotation.

The data are segmented into minimal regions spanning the primary data. Minimal regions are identified as the smallest unit any of the tokenizations applied to data references. Token annotations reference these regions as appropriate. Sentences reference regions in primary data.

Samples

Please consult this email sample and telephone sample.

Updates

None at this time.

Content Copyright

Portions 2003, 2005, 2013 American National Corpus Project, 2000 The Associated Press, 1987-1989 Dow Jones & Company, Inc., 1999-2002 English Language Institute, the University of Michigan, 2004 Ferd Eggan, 2003 Indiana Center for Intercultural Communication, 2003 Langenscheidt Publishers, 1996-2000 Microsoft, Inc., 2000, 2002 New York Times, 1999, 2001, 2003 Oxford University Press, 2003 Word, Inc., 1998-2005 Orin Hargraves, 1993, 1997-2003, 2005, 2010, 2013 Trustees of the University of Pennsylvania