2001 Topic Annotated Enron Email Data Set


Item Name: 2001 Topic Annotated Enron Email Data Set
Authors: Dr. Michael W. Berry, Murray Browne and Ben Signer
LDC Catalog No.: LDC2007T22
ISBN: 1-58563-441-7
Release Date: Jun 20, 2007
Data Type: text
Data Source(s): email
Application(s): information retrieval, metadata extraction, topic detection and tracking
Language(s): English
Language ID(s): eng
Distribution: 1 DVD, Web Download
Member fee: $0 for 2007 members
Non-member Fee: US $1000.00
Reduced-License Fee: US $500.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Dr. Michael W. Berry, Murray Browne and Ben Signer
2007
2001 Topic Annotated Enron Email Data Set
Linguistic Data Consortium, Philadelphia

Introduction

The 2001 Topic Annotated Enron Email Data Set contains approximately 5000 (4936) emails from Enron Corporation (Enron) manually indexed into 32 topics. It is a subset of the original Enron Email Data Set of 1.5 million emails that was posted on the Federal Energy Regulatory Commission website as a matter of public record during the investigation of Enron. The original set suffered from document integrity problems; attempts were made to improve the quality of the data and to remove some sensitive and private information. Dr. William Cohen of Carnegie Mellon University took the lead in distributing the improved corpus, consisting of 517,431 Enron employee emails that covered the period 1999-2002.

This corpus is a subset of the Carnegie Mellon data set and covers the period from January 2001 to December 2001. The email topics reflect the business activities and interests of Enron employees in that year: California energy problems and the subsequent state and Federal investigations, Enron's downfall (newsfeeds and interoffice communications), Enron's venture with the Dabhol India Power Company, Enrononline (Enron's trading infrastructure), competitors (Dynegy, El Paso Pipeline) and even fantasy football and college football. Eliminated from this data set are duplicates, emails that are too small and emails that are not really topics but are types (personnel memos and personal quips). The manual indexing was performed in the summer of 2006 by two people who worked closely together: a research associate familiar with the Enron saga and a junior in economics at the University of Tennessee.

The original Enron Email Data Set is the first large email set made available to researchers, but until now there has been no ability to assess the performance of topic detection and tracking algorithms with the email set. Having an annotated subset such as this one should provide text mining researchers with a way to evaluate the accuracy of new algorithms for clustering and classification. This data set can also be used to provide communication context for researchers using the Enron Email Data Set in social network analysis. Previous annotations such as the one developed at UC Berkeley have been primarily based on email type rather than the specific topic(s) of discussion. This annotation can be used to qualify the discussion topics between individuals and groups comprising a social network of Enron employees.

Due to the complexity of this corpus' directory structure, it will be distributed as compressed tar file on a cd. Most compression utilities will uncompress the package.

Updates

An update is available via web download. This update corrects a small error in the subjection annotation file. Those members and licensees who received this publication prior to Aug 13, 2007 should download this correction. All copies issued since this date have been corrected.

Content Copyright

Portions 2006, 2007 Dr. Michael W. Berry, 2007 Trustees of the University of Pennsylvania