Manually Annotated Sub-Corpus First Release

Item Name: Manually Annotated Sub-Corpus First Release
Author(s): Nancy Ide, Keith Suderman, Collin Baker, Rebecca Passonneau, Christiane Fellbaum
LDC Catalog No.: LDC2010T22
ISBN: 1-58563-569-3
ISLRN: 461-028-050-892-8
Release Date: December 20, 2010
Member Year(s): 2010
DCMI Type(s): Text
Data Source(s): web collection, varied, transcribed speech, telephone speech, newswire, email
Project(s): American National Corpus (ANC)
Application(s): natural language processing
Language(s): English
Language ID(s): eng
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2010T22 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Ide, Nancy, et al. Manually Annotated Sub-Corpus First Release LDC2010T22. Web Download. Philadelphia: Linguistic Data Consortium, 2010.

Introduction

The Manually Annotated Sub-Corpus First Release (MASC I), Linguistic Data Consortium (LDC) catalog number LDC2010T22 and isbn 1-58563-569-3, is the first of three releases of 500,000 words of MASC data developed as part of the American National Corpus (ANC) project. MASC I consists of approximately 80,000 words of contemporary spoken and written American English annotated for a variety of linguistic phenomena. The MASC project is sponsored by the National Science Foundation and was established to address, to the extent possible, many of the obstacles to the creation of large-scale, robust, multiply-annotated corpora of English covering a wide range of genres of written and spoken language data. Researchers from Vassar College, Columbia University and the International Computer Science Institute, University of California at Berkeley are the principal participants the WordNet project provides consulting.

The source texts in MASC I are drawn from the open portion of the American National Corpus (ANC) Second Release LDC2005T35, which includes written texts and spoken transcripts of American English from a broad range of genres produced since 1990 and from the Language Understanding Annotation Corpus LDC2009T09, (LU Corpus), a collection of various genres including broadcast, newswire, email and telephone speech annotated for committed belief, event and entity coreference, dialog acts and temporal relations. All of the words of data in MASC I have validated annotations for token, part of speech, sentence boundary, noun chunks, verb chunks, named entities and Penn Treebank syntax. Full-text FrameNet annotations are available for seventeen texts and WordNet word sense annotations are available for 1000 occurrences of each of fifty-three words. Annotations of all or portions of the sub-corpus for a wide variety of other linguistic phenomena have been contributed by other projects. Software and services available from the ANC project website enable transduction of MASC into a wide variety of physical formats.

Data

The MASC directory contains two folders: masc-1.0.3 and masc_wordsense. masc-1.0.3 contains the actual MASC corpus and consists of two folders, spoken and written. The spoken folder contains data and annotations for spoken material, and the written folder contains the same for written texts. The files in each of the respective folders have naming conventions that describe the contents of the file.

masc_wordsense contains the MASC sentence samples with word sense annotations using WordNet sense numbers as the annotation values.

Updates

Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2010T22.

Samples

Contact: ldc@ldc.upenn.edu © 2010 Linguistic Data Consortium , Trustees of the University of Pennsylvania . All Rights Reserved.

Available Media

View Fees

Member
Non-Member
Reduced-License
Extra Copy
Login for the applicable fee