2001 Topic Annotated Enron Email Data Set

Item Name: 2001 Topic Annotated Enron Email Data Set
Author(s): Dr. Michael W. Berry, Murray Browne, Ben Signer
LDC Catalog No.: LDC2007T22
ISBN: 1-58563-441-7
ISLRN: 171-422-435-824-5
Release Date: June 20, 2007
Member Year(s): 2007
DCMI Type(s): Text
Data Source(s): email
Application(s): topic detection and tracking, metadata extraction, information retrieval
Language(s): English
Language ID(s): eng
License(s): 2001 Topic Annotated Enron Email Data Set Agreement
Online Documentation: LDC2007T22 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Dr. Michael W. Berry, Murray Browne, and Ben Signer. 2001 Topic Annotated Enron Email Data Set LDC2007T22. Web Download. Philadelphia: Linguistic Data Consortium, 2007.

Introduction

The 2001 Topic Annotated Enron Email Data Set contains approximately 5000 (4936) emails from Enron Corporation (Enron) manually indexed into 32 topics. It is a subset of the original Enron Email Data Set of 1.5 million emails that was posted on the Federal Energy Regulatory Commission website as a matter of public record during the investigation of Enron. The original set suffered from document integrity problems; attempts were made to improve the quality of the data and to remove some sensitive and private information. Dr. William Cohen of Carnegie Mellon University took the lead in distributing the improved corpus, consisting of 517,431 Enron employee emails that covered the period 1999-2002.

This corpus is a subset of the Carnegie Mellon data set and covers the period from January 2001 to December 2001. The email topics reflect the business activities and interests of Enron employees in that year: California energy problems and the subsequent state and Federal investigations, Enron's downfall (newsfeeds and interoffice communications), Enron's venture with the Dabhol India Power Company, Enrononline (Enron's trading infrastructure), competitors (Dynegy, El Paso Pipeline) and even fantasy football and college football. Eliminated from this data set are duplicates, emails that are too small and emails that are not really topics but are types (personnel memos and personal quips). The manual indexing was performed in the summer of 2006 by two people who worked closely together: a research associate familiar with the Enron saga and a junior in economics at the University of Tennessee.

The original Enron Email Data Set is the first large email set made available to researchers, but until now there has been no ability to assess the performance of topic detection and tracking algorithms with the email set. Having an annotated subset such as this one should provide text mining researchers with a way to evaluate the accuracy of new algorithms for clustering and classification. This data set can also be used to provide communication context for researchers using the Enron Email Data Set in social network analysis. Previous annotations such as the one developed at UC Berkeley have been primarily based on email type rather than the specific topic(s) of discussion. This annotation can be used to qualify the discussion topics between individuals and groups comprising a social network of Enron employees.

Due to the complexity of this corpus' directory structure, it will be distributed as compressed tar file on a cd. Most compression utilities will uncompress the package.

Updates

As of Aug 13, 2007, an update corrects a small error in the subjection annotation file. Those members and licensees who received this publication prior to Aug 13, 2007 should re-download the corpus. All copies issued since this date have been corrected.

Available Media

View Fees





Login for the applicable fee