Manually Annotated Sub-Corpus First Release


Item Name: Manually Annotated Sub-Corpus First Release
Authors: Nancy Ide, Keith Suderman, Collin Baker, Rebecca Passonneau, Christiane Fellbaum
LDC Catalog No.: LDC2010T22
ISBN: 1-58563-569-3
Release Date: Dec 20, 2010
Data Type: text
Data Source(s): email, newswire, telephone speech, transcribed speech, varied, web collection
Project(s): American National Corpus (ANC)
Application(s): natural language processing
Language(s): English
Language ID(s): eng
Distribution: Web Download
Member fee: $0 for 2010 members
Non-member Fee: US $0.00
Reduced-License Fee: US $0.00
Extra-Copy Fee: N/A
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Nancy Ide, et al.
2010
Manually Annotated Sub-Corpus First Release
Linguistic Data Consortium, Philadelphia

Introduction

The Manually Annotated Sub-Corpus First Release (MASC I), Linguistic Data Consortium (LDC) catalog number LDC2010T22 and isbn 1-58563-569-3, is the first of three releases of 500,000 words of MASC data developed as part of the American National Corpus (ANC) project. MASC I consists of approximately 80,000 words of contemporary spoken and written American English annotated for a variety of linguistic phenomena. The MASC project is sponsored by the National Science Foundation and was established to address, to the extent possible, many of the obstacles to the creation of large-scale, robust, multiply-annotated corpora of English covering a wide range of genres of written and spoken language data. Researchers from Vassar College, Columbia University and the International Computer Science Institute, University of California at Berkeley are the principal participants the WordNet project provides consulting.

The source texts in MASC I are drawn from the open portion of the American National Corpus (ANC) Second Release LDC2005T35, which includes written texts and spoken transcripts of American English from a broad range of genres produced since 1990 and from the Language Understanding Annotation Corpus LDC2009T09, (LU Corpus), a collection of various genres including broadcast, newswire, email and telephone speech annotated for committed belief, event and entity coreference, dialog acts and temporal relations. All of the words of data in MASC I have validated annotations for token, part of speech, sentence boundary, noun chunks, verb chunks, named entities and Penn Treebank syntax. Full-text FrameNet annotations are available for seventeen texts and WordNet word sense annotations are available for 1000 occurrences of each of fifty-three words. Annotations of all or portions of the sub-corpus for a wide variety of other linguistic phenomena have been contributed by other projects. Software and services available from the ANC project website enable transduction of MASC into a wide variety of physical formats.

Data

The MASC directory contains two folders: masc-1.0.3 and masc_wordsense. masc-1.0.3 contains the actual MASC corpus and consists of two folders, spoken and written. The spoken folder contains data and annotations for spoken material, and the written folder contains the same for written texts. The files in each of the respective folders have naming conventions that describe the contents of the file.

masc_wordsense contains the MASC sentence samples with word sense annotations using WordNet sense numbers as the annotation values.

Updates

Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2010T22.

Samples

Content Copyright

Portions 2000 The Associated Press, 1987-1989 Dow Jones & Company, Inc., 2000 New York Times, 1997-2002, 2010 Trustees of the University of Pennsylvania

Contact: ldc@ldc.upenn.edu 2010 Linguistic Data Consortium , Trustees of the University of Pennsylvania . All Rights Reserved.