Avocado Research Email Collection
Version 1.0.3

Douglas W. Oard      <oard@umd.edu>
William Webber       <wew@umiacs.umd.edu>
David A. Kirsch      <dkirsch@umd.edu>
Sergey Golitsynskiy  <sergeig@umd.edu>

README

Date:  Mon Nov 24 10:18:44 EST 2014

1. INTRODUCTION
===============

The Avocado Research Email Collection ("the Avocado collection") is a
corpus of emails and attachments, distributed for use in research and
development in e-discovery, social network analysis, and related
fields.  Please carefully read and adhere to the usage agreements
before beginning work on the collection.  All users must personally
sign the Avocado Collection End User Agreement before working with the
collection, and the collection must be stored on a password protected
computer in a way prevents access to the collection by anyone who has
not personally signed the Acocado Collection End User Agreement.

The Avocado collection consists of records taken from the PST
files of accounts of a now-defunct IT company.  We refer to this
company using the pseudonym "AvocadoIT"; references to the
company name in the collection are also replaced with this
pseudonym.  A PST, or "Personal Storage Table", file is used by
MS Outlook to store emails, calendar entries, contact details,
and related information.  We will refer to the processed contents
of these PST files as the "personal folders" of these accounts.

The source data for the Avocado collection consisted of the PST files
for 282 accounts.  Most of these accounts are those of employees of
AvocadoIT; the remainder represent shared accounts such as "Leads", or
system accounts such as "Conference Room Upper Canada".  Data was
extracted from these PST files using libpst version 0.6.54.  Three PST
files produced no output: one was corrupt and two were empty.  The
Avocado collection consists of the processed personal folders of the
remaining 279 accounts.  We follow e-discovery practice by referring
to each of these accounts as a "custodian", although some of them do
not correspond to individual humans.

The collection is divided into metadata and text.  The metadata
is represented in XML, with a single top-level XML file listing
the custodians, and then one XML file per custodian listing all
items extracted from that custodian's PST files.  The full XML
tree can be read by loading the top-level file with an XML parser
that handles <xi:include> directives, but the resulting in-memory
DOM tree is very large (over 32GB on our machines), and it may be
more practical to process the per-custodian XML files one at a
time.  We describe the top-level organization of the custodian
XML metadata files in Section 2, and the detailed metadata for
each item in the personal folder in Section 3.

The text contains the extracted text of the items in the
custodians' folders, with the extracted text for each item being
held in a separate file.  The text files are then zipped up into
a zip file per custodian.  It may be more efficient to process
the text files directly in their zip folders, rather than
unzipping them first, as the latter may lead to them being
scattered across the physical disk (depending on the behaviour of
your file system).  The contents of the text files are described
in Section 4.  Section 5 describes the redaction performed upon
the collection, and Section 6 provides collection statistics.

1A. A note on processing
------------------------

Metadata stored in the PST files is reproduced with minimal
processing (aside from redaction) in the:

   <item>
      <metadata>
         ...
      </metadata>
   </item>

tag for the corresponding item.  For emails, this metadata
was either created natively by Exchange (for EX transport types)
or parsed by Exchange from SMTP headers (for SMTP types).  We
have made no effort to canonicalize email addresses, standardize
headers, align header and metadata fields, regularize metadata
names, or the like.  Metadata fields are reproduced as found
in the PST files, while SMTP headers are given as found in the
email text itself.  

The only modification we have made is to reconstruct minimal
email headers for EX-transport emails (Section 4A.I), for the
convenience of those wishing to index the extracted text with
minimal reference to the XML metadata files.  Even in this case,
however, the user may need to refer back to the XML files for 
other metadata, such as the calculated duplicates (Section
3B.III).

2. CUSTODIAN METADATA
=====================

There is an XML metadata file describing the contents of the
personal folder for each custodian.  The XML data for a custodian
has the following form:

   <custodian id="XXX" items="NNNN">
      <source_file />
      <pst-folders />
      <items />
   </custodian>

The "id" attribute gives the identifier for the custodians, which
is a number assigned in alphabetical order of the name of the
custodian's PST file, counting up from 001.  The "items"
attribute gives a count of the total number of items in the
custodian's PST file.  The child elements are described in the
following subsections.

All XML metadata files are encoding in utf-8.

2A. Source file
---------------

The source file tag gives basic information about the PST file
that the custodian's personal folder has been extracted from.  It
has the form:

   <source_file size="XXX.XMB">FOO.PST</source_file>

where "FOO.PST" is the name of the PST file, and the "size"
attribute gives the size in MB of the PST file.

2B. PST folder structure
------------------------

PST files have an internal, tree-like, folder organization,
with each PST item being contained within a folder.  The
folder structure is represented as follows:

   <pst-folders>
      <pst-folder id="XXXX" items="N" subfolders="N">
         <name>FOLDER_NAME</name>
         <pst-folder ... >
            ...
         </pst-folder>
         <pst-folder ... >
            ...
         </pst-folder>
      </pst-folder>
   <pst-folders>

The <pst-folder/> elements may be nested to arbitrary depth.
The "items" attribute counts the number of immediately contained
items (not including those in subfolders), and the "subfolders"
attribute counts the number of immediately contained subfolders
(not including sub-subfolders).  The folder "id" is unique across
the collection.  The <name> tag contains the PST folder name.


2C. Items
---------

The content of the personal folder is stored in items of
different types.  The items are listed inside the <items>, as
follows:

   <items>
      <item ...>
      </item>
      <item ...>
      </item>
   </items>

Items are not nested; their folder structure is represented
by back references to the <pst-folders> section.  The different
item types, and their common and type-specific metadata, is
described in Section 3.


3. ITEM METADATA
================

All extractable items are extracted from the custodian's personal
folder, though some item types are redacted (see Section 5B).
Retained item types are:

   emails
   attachments
   extracted files
   calendar-like items (appointments, schedules, and tasks)
   contacts
   reports

We describe fields and formats common to all items first, and
then each item type individually.

3A. Common item attributes
--------------------------

3A.I: Item tag
...............

The XML record for each item is an <item> element, of form:

   <item id="XXX-XXXX[_N]-TG" 
         type="(appointment|attachment|contact|email|extracted file|report|schedule|task|journal|other|stickynote)" 
         [duplicates="N"] [replies="N"] [redacted="true"]
         [attachments="N"] [pst-folder-id="XX"]>
      <files />
      <relationships />
      <file-data />
      <container-data />
      <metadata />
   </item>

The attribute "id" gives the item id (Section 3A.II), and "type"
gives the type of the item.  The last three types are redacted
(Section 5B).  The item tag may also have the attribute
"duplicates" for the count of duplicates (see Section 3B.III for
email duplicates, 3C.III for attachment duplicates, and 3D.III
for extracted file duplicates); "replies" for the count of
replies, if the item is an email (see Section 3B.II); "redacted"
if the item is fully redacted (see Section 5); "attachments" for
the number of attachments (only if not itself an attachment or
extracted file) (Section 3C); and "pst-folder-id" for the PST
folder that this item was found in (not for extracted files)
(Section 2).

3A.II: Item id
.............

Each item in the collection is assigned an item id.  Excepting
attachments (Section 3C) and extracted files (Section 3D), these
ids have the form:

    <cust-id>-<item-num>-<type-tag>

Attachments files extend this by sub-numbering:

    <cust-id>-<item-num>_<att-num>-<type-tag>

and extracted files sub-number the attachments (possibly with
nesting):

    <cust-id>-<item-num>_<att-num>_<file-num>-<type-tag>

The custodian IDs count up sequentially from 001 (3 digits),
while the item numbers count up sequentially from 000001
(6 digits), for each custodian.

3A.III: Files tag
.................

The <files> tag has the form:

   <files>
       <file type="text" path="text/<cust-id>/<item-id>.txt"/>
   </files>

The type is always "text" for this rendering of the collection.
The path gives the path from the top-level directory of the
hierarchy to the text file, assuming the text ZIP files are
unzipped.  The text file is contained under the "text" directory
in a folder named with the custodian numeric id, <cust-id> (or,
in the zipped version of the collection, a filename named
<cust-id>.zip).  The name of the file within this directory (or
zip file) is the id of the item, with extension ".txt".

The <files> tag is only present for emails, calendar-like items,
and for attachments or extracted files where at least 1 character
of text has been extracted, as only these files have text
extracted from them.


3A.IV: Relationships tag
........................

The <relationships> tag has the form:

   <relationships>
     <attached_to id="XXX-XXXX-TG" />
     <duplicate_of id="XXX-XXXX-TG" />
     <reply_to id="XXX-XXXX-TG" />
     <extracted_from id="XXX-XXXX-TG" />
   </relationships>

For the <attached_to> relationship, see Section 3C.  For the
<duplicate_of> relationship, see Section 3B.III for email
de-duplication, Section 3C.III for attachment de-duplication, and
Section 3D.III for extracted file de-duplication.  For the
<reply_to> relationship, see Section 3B.II.  For the
<extracted_from> relationship, see Section 3D.


3A.V: File-data tag
...................

The <file-data> tag has the form:

   <file-data size="N" extracted-chars="N">
      <name>CCC</name>
      <extension>.XXX</extension>
      <mime-type>CCC</mime-type>
      <mime-subtype>.XXX</mime-subtype>
   </file-data>

It is present only for attachment and "extracted file" types;
see Section 3C and Section 3D below.


3A.VI: Container-data tag
.........................

The <container-data> tag has the form:

   <container-data type="(zip|gzip|tar|pst)" is_extracted="true"
       extracted="N">

It is only found in attachments or extracted files that are
containers.  See Section 3D for details.


3A.VII: Metadata tag
....................

The <metadata> tag has the form:

   <metadata>
      <field name="XXX">CCC</field>
      ...
   </metadata>

See the respective item types below for more comments on
metadata.  Extracted files and attachments have no metadata tags.


3A.VIII: Common metadata fields
...............................

The following metadata fields appear in multiple different
item types.  The meanings of the fields are as inferred by us
from the data itself, and from online references for Outlook,
PST, and email data.

   - create_date : the date and time the item was created, in UTC
   - modify_date : the date and time the item was last modified,
       in UTC
   - file_as     : a human-readable reference for the item
       (possibly automatically generated by Outlook or Exchange).
       For emails, this may be the sent-to address, or the name of a
       file attached to the email.  For contacts, this is the name
       of the contact.  For appointments, this appears to be the
       name of the person making the appointment, which is not
       otherwise captured.  For tasks, it is sometimes the name
       of the task, sometimes the name of the file attached to
       the task, and sometimes seems to be an identifier of the
       alias to whom the task is assigned (such as "@HelpDesk").
   - outlook_version : the version of Outlook used to create the
       item (mostly 9.0, but with other versions varying from
       8.0 up to 10.0).
   - response_requested : whether the creator of the item would
       like a response.  Found in emails, calendar items,
       reports, and contacts, though semantics in reports and
       contacts is unclear.
   - subject : the subject of the email, of the meeting or task
       for a calendar item, or a summary of the report type
       for reports.
   - flags : a bitmap in which the bits have the following
       semantics (taken from libpst.h):

         PST_FLAG_READ           0x01
         PST_FLAG_UNMODIFIED     0x02
         PST_FLAG_SUBMIT         0x04
         PST_FLAG_UNSENT         0x08
         PST_FLAG_HAS_ATTACHMENT 0x10
         PST_FLAG_FROM_ME        0x20
         PST_FLAG_ASSOCIATED     0x40
         PST_FLAG_RESEND         0x80
         PST_FLAG_RN_PENDING     0x100
         PST_FLAG_NRN_PENDING    0x200

3B. Email
---------

Email items have two transport types: Exchange (EX) or SMTP
(SMTP).  The type of transport affects the address metadata
fields: Exchange messages have names or LDAP identifiers for
addresses; SMTP messages have RFC822 email addresses.  The
transport type is specified in the "sender_access" metadata
field.

3B.I: Email metadata
......................

Email items commonly have the following metadata fields:

   - arrival_date : the date and time at which the email
       arrived, in UTC.
   - autoforward : whether the email was automatically 
       forwarded to another email address.  The forwarded-to
       address does not appear to be captured in the email
       metadata.
   - bcc_address : a list of addresses the email was BCC'ed
       to.  The addresses are expressed either as names
       (for Outlook messages) or as RFC822 email addresses
       (for SMTP messages).   This metadata field is only
       found in the personal folder of the email's sender.
   - cc_address : a list of addresses the email was CC'ed to.
       Format as for "bcc_address".
   - delete_after_submit : inverse of whether a copy of the
       email should be saved at the sender's end.  Semantics
       are unclear: some emails marked "deleted_after_submit"
       are still found in the Outbox of the sender (particularly
       notifications about viruses).
   - delivery_report : does the sender request an automatic
       report of the message having been successfully delivered?
       This will sometimes (though not always) match up with
       a "report" item that has subject "Delivered: <original
       subject>".
   - importance : the sender-assigned "importance" of an email,
       displayed to the recipient by Outlook.  Either "normal"
       (96% of cases), "high" (3.6% of cases), or "low" (a
       handful of cases).
   - in_reply_to : the message id that this email is in reply
       to.  If that email is also contained in the collection,
       we record a <reply_to> relationship (Section 3B.II).
   - message_cc_me : is the custodian in the list of CC
       addresses?
   - message_recip_me : is the custodian in either the list of
       CC addresses or the list of TO addresses?
   - message_to_me : is the custodian in the list of TO
       addresses?
   - messageid : the unique identifier for this email.
   - original_cc : only occurs a handful of times, when it
       gives a longer list of CCs than the cc_address field.
       The meaning of the difference is unclear.
   - original_to : only occurs a handful of times, when it
       gives an alternative recipient list to the sentto_address
       field.  The meaning of the difference is unclear.
   - original_sensitivity : differs from "sensitivity" (see
       below) in 144 instances.  In all but one of these,
       "sensitivity" is set to a higher level of sensitivity
       than "original_sensitivity".  The meaning of the
       difference is unclear (perhaps an amendment of the
       recipient?).
   - outlook_recipient_name : the name of the recipient by which
       this email made it into the custodian's personal folder;
       that is, the name of the custodian as Outlook understands
       it, though sometimes it is an alias or a role (such as
       "@HelpDesk"), and in a few cases the name appears to be
       erroneous (or possibly indicates that a custodian took 
       over another user's personal folder).
   - outlook_sender_name : the name of the sender of the email,
       as a natural name (without email address).
   - priority : an indicator of the priority of the email.  One 
       of "normal" (76%), "nonurgent" (21%), or "urgent" (3%).
       I believe this is explicitly set by the sender, and is
       intended for the attention of the recipient.
   - processed_subject : the original subject of the thread to 
       which this email belongs.  Frequently, the "subject"
       field with "RE:" and/or "FW:" stripped from the front
   - read_receipt : whether the send requests a report when
       the recipient reads the email.  Sometimes (though now
       always) matches up with a "report" item that has
       subject "Read: <original subject>".
   - recip_access : the transport mechanism by which the
       recipient received the email.  One of EX (most common)
       or SMTP (about 1% of cases).
   - recip_address : the address of the recipient.  For messages
       delivered via SMTP, an RFC822 address; for messages
       delivered via EX, an LDAP identifier.
   - reply_requested : whether a reply is requested from the
       recipient.  Holds the same value as "response_requested"
       for all but a handful of emails.
   - reply_to : the address to which replies should be sent,
       when this is not the same as the sender.  For SMTP
       messages, this is an RFC822 email address; for EX messages,
       this is a natural name.
   - return_path_address : addresses for bounces to go to.  Only
       for SMTP messages.
   - sender_access : the transport method which the email came
       from.  Most common values are "EX" and "SMTP", but with
       a couple of hundred "FAX" (an automated fax-to-email
       bridge, with the fax attached as a TIFF), a hundred-odd
       NONE, and a handful of HANDMAIL and SYSTEM.
   - sender_address : the address of the sender.  For SMTP
       messages, this is an RFC822 email address; for EX
       messages, this is an LDAP identifier.
   - sender2_access : the transport method corresponding to
       the "sender2_address".  This may differ from
       "sender_access" if "sender2_address" (see next) differs
       from "sender_address".
   - sender2_address : for six-hundred-odd emails, this field
       differs from the "sender_address" field.  Common cases
       are where the former gives an RFC822 address and the 
       latter an Exchange LDAP identifier; or the former gives 
       a mailing list address, and the latter a personal
       address.
   - sensitivity : the sensitivity of the email, as (I believe)
       explicitly set by the sender.  One of "none" (almost
       all emails), "company confidential" (235 emails),
       "personal" (10), or "private (67).  See also 
       "original_sensitivity".  All emails with "sensitivity"
       of "private" or "personal" have been redacted (Section 5B).
   - sent_date : the date and time the email were sent, in UTC.
   - sentto_address : the address or addresses to which the
       email was addressed.  Addresses are either natural names
       or RFC822 email addresses.  This will not be the same
       as "outlook_recipient_name" if:
          + the message is outgoing
          + the custodian was CC'ed
          + the custodian was part of a list of TO addresses
          + the email was sent to a group list (e.g. "All
              Employees") to which the custodian belonged
   
Various other metadata fields occur less frequently in emails,
such as "X-" header fields pulled in from SMTP messages.

Certain other metadata fields are present in the PST version
of the email, but have been removed from the processed collection
as redundant or uninformative:

   + htmlbody : an HTML rendering of the body
   + Internet Charset Body: a rendering of the body in an 
       "internet-friendly" character set

   + conversion_prohibited : value is always "0"
   + ndr_diag_code:   value is always "0"
   + ndr_reason_code: value is always "0"
   + ndr_status_code: value is always "0"
   + rtf_body_tag:    information about RTF rendering of message,
       which is not retained.
   + rtf_body_char_count:   (ditto)   
   + rtf_body_crc:          (ditto)
   + rtf_body_in_sync:      (ditto)
   + rtf_in_sync:           (ditto)
   + rtf_ws_prefix_count:   (ditto)

   + recip2_access     : always the same as recip_access
   + recip2_address    : always the same as recip_address

   + outlook_sender    : can be reconstructed as 
       sender_access + ":" + sender_address
   + outlook_sender2   : can be reconstructed as
       sender2_access + ":" + sender2_address
   + outlook_recipient : can be reconstructed as
       recip_access + ":" + recip_address
   + outlook_recipient2: can be reconstructed as
       recip2_access + ":" + recip2_address


3B.II: Email replies
....................

PST metadata includes an "in_reply_to" field for certain emails,
containing the message ID of the email that this email is replying
to.  We resolve and mark these reply-to relationships.  In the
replying email, the relationship is marked by the tag:

   <relationships>
       <reply_to id="XXX"/>
   </relationships>

where the ID refers to the collection ID (_not_ the message id)
of the replied-to email.  In the replied-to email, the number
of replies (but not the reply ids) is recorded as the "replies"
attribute of the <item> tag.


3B.III: Email de-duplication
..........................

Two emails are considered duplicates if and only if they
have the same message ID and the same subject line.  

Note that some messages with the same message ID have different
subjects and contents.  For instance, the notification that a
message could not be delivered or contained a virus has the same
message ID as the original message, but different contents.
Surprisingly, message IDs are sometimes identical even for quite
different messages.  It also happens that two emails are
seemingly the same (same message ID, sender, and date), but
have different subjects.  The deduplication carried out here
should be considered a reasonable approximation, not perfect.

We attempt to identify the sender's version of an email and make
it the canonical version in a duplicate set.  The sender version
is detected first as the version for which "outlook_sender_name"
is the name of the custodian who owns the mailbox; this is
determined by looking at the "outlook_recipient_name" values of
messages in the mailbox where the "message_to_me" metadata field
is "1".  If no emails in the duplicate set have the custodian's
name as the sender name, or if more than one does, then the email
having a "bcc_address" field is selected.  If more than one email
tie, then an arbitrary email is chosen.  Note that the sender may
have more than one copy of the email in their mailbox; for
instance, one may be in their "sent" folder, while they may have
explicitly CC'ed another to themselves.

The duplicate versions of an email point to the canonical
version with the tag:

   <relationships>
       <duplicate_of id="XXX">
   </relationships>

while the canonical version of the email will count the number
of duplicates (not including itself) in the "duplicates"
attribute of the <item> tag.  An email that is not in a
duplicate set has neither the "duplicates" attribute nor the
<duplicate_of> tag.

Both canonical and duplicate versions of an email are listed in
full in metadata and as text.  De-duplication only records the
duplicate links between these items.

Duplicate emails may have differing attachments.  Sometimes files
attached by the sender are stripped off at the receiver (VCARDs,
for instance, or other attachment types).  Occasionally, the
receiver has attachments that the sender did not send; these
appear to be automatically-generated attachments, such as
notifications by virus checkers.

3C. Attachments
---------------

An attachment is a file attached to another PST item type.  Note
that not only emails may have attachments, but also calendar-like
items, contacts, and even reports.  Attachment ids are counted
sequentially from the item they are attached to.  So, for
instance, the first attachment of item:

   0001-000001-EM

will be:

   0001-000001_1-AT

and the second will be

   0001-000001_2-AT

Attachments have no metadata tags.

3C.I: Attachment links
......................

The item that an attachment is attached to is identified by
the:

   <relationships>
      <attached_to id="XXX"/>
   </relationships>

tag.  Items that are capable of having attachments (all except
for attachments and extracted files) have the number of their
attachments counted in the "attachments" attribute of the <item>
tag.


3C.II: Attached file
....................

Information about the attached file is held in the <file-data>
tag, as follows:

   <file-data size="N" extracted-chars="N">
      <name>CCC</name>
      <extension>.XXX</extension>
      <mime-type>CCC</mime-type>
      <mime-subtype>XXX</mime-subtype>
   </file-data>

The "size" attribute is the size in bytes of the native format 
of the attachment.  The "extracted-chars" gives the number of
characters extracted from the file as text.  Text extraction is
performed using Tika version 1.1.  If any text is successfully
extracted (and the item is not redacted), then the extracted text
can be found in the location given by the <files><file></files>
element.

The <name> and <extension> are taken from PST metadata.  Some
attachments are anonymous, and have no name or extension.  The
<mime-type> and <mime-subtype> is as detected by the UNIX "file"
utility.   


3C.III: Attachment de-duplication
.................................

Attachments are de-duplicated based upon the md5sum of the
native format of the file.  The canonical version of a set
of duplicate attachments is the first one encountered in the 
collection.  Duplicates and canonical versions are marked
using the <relationships><duplicate_of></relationships> tag 
and "duplicates" attribute, as with emails (see Section 3B.III).
Attachment and email de-duplication is calculated independently:
two emails can have duplicate attachments without themselves
being duplicates.  Some items (for example, vcards) have a very
large number of duplicates in the collection.  As with emails,
duplicate attachments are included verbatim in the collection.


3D. Extracted files
...................

An attached file could be a container which itself holds 
other files.  These files are recursively extracted, with
the extracted files having type "extracted file".  Container
files of type ZIP, TAR, GZIP, and PST are handled.  GZIP
containers will only have a single extracted file.  Note that
PST attachments are handled, and their contents fully extracted.

The ID of an extracted file is numbered sequentially from the
attachment or extracted file they are extracted from.  So, for
instance, if:

    0052-000871-AT

is a container, then the first item in the container is:

    0052-000871_1-EX

and if that item itself is a container, then the first
item inside that container is:

    0052-000871_1_1-EX

and so forth.

Extracted files have no metadata tags.


3D.I: Extracted file links
..........................

A file that is a container file has a <container-data> tag
with the following form:

   <container-data type="(zip|gzip|tar|pst|)" is_extracted="true"
       extracted="N">

The "extracted" attribute gives the number of items extracted
from the container.  An extracted file points back to the
container it was extracted from using the:

   <relationships>
     <extracted_from id="XXX-XXXX-TG" />
   </relationships>

field.


3D.II: Extracted file
.....................

Information about the extracted file is held in the <file-data>
tag.  This has the same form and semantics as for attachments
(Section 3C.I).


3D.III: Extracted file de-duplication
.....................................

Extracted files are de-duplicated based on the md5sum values of
their native versions.  De-duplication is performed and noted in
the same way as for attachments (Section 3C.III).  Extracted
files and (top-level) attachments can be duplicates of each
other.


3E. Calendar-like items
-----------------------

There are three item types that we class as calendar-like items:
appointments; schedules; and tasks.   Appointments are full
calendar items, with start and end times, alarms, reminders
and so forth (Section 3E.I).  Schedules and tasks have only the
common metadata fields described in Section 3A.VIII, most
particularly "create_date" and "subject", as well as a "body" that
goes into the text rendering of the item (Section 4C).
Appointments may have the following metadata items:

   - alarm : whether to raise an alarm to the user shortly
       before the appointment.
   - alarm_minutes : the number of minutes before the appointment
       to raise the alarm.
   - all_day : is this an all-day appointment?
   - end : the end time for the appointment (in UTC).
   - is_recurring : is this a recurring appointment?
   - label : always "None" in this collection.
   - location : the location of the appointment, as a natural
       name (XXX's office, YYY conference room); or a note saying 
       that the location is to be determined; or the name of the
       person the meeting is with; or the phone number to call
       for a teleconference.
   - recurrence_description : a textual description of the
       meeting recurrence (e.g. "every Friday from 11:00AM to
       12:30PM").
   - recurrence_end : the date and time on which the appointment
       stops recurring (in UTC).
   - recurrence_start : the date and time on which the 
       appointment starts recurring (in UTC).
   - recurrence_type : the frequency of recurrence of the item
       ("daily", "weekly", "monthly", or "yearly")
   - reminder : the date and time at which to send a reminder
       (in UTC).
   - showas : how the time is shown in the user's calendar
       ("busy", "fee", "out of office", or "tentative").
   - start: start time for the appointment (in UTC).
   - timezonestring : a string giving the timezone in which
       the meeting is being scheduled.


3F. Contacts
------------

Contact items hold metadata information about the contacts held
in a custodian's PST file.  

Contact items are _not_ rendered into the text version of the
collection; they are _only_ stored in the XML data.  Vcards
that are attached to emails, however, are stored in Vcard format
in the text representation, and are not parsed into metadata in
the XML metadata representation.

The metadata fields occurring in more than a handful of contacts 
(in additional to the common metadata fields described in Section 
3A.VIII) are:

   - account_name : typically, name of user's system login
       account.
   - address1 : the person's "address", normally either as an RFC822
       email address or as an LDAP identifier, but sometimes
       simply their name.
   - address1_desc : where this differs from "address1", it
       typically gives the natural name of the person identified
       by address1.
   - address1_transport : the transport mechanism for sending 
       messages to "address1".  One of "SMTP", "EX", "FAX", or
       "MAILTO".
   - address2 : a secondary address for the contact (such as
       an alternative email address).
   - address2_desc : as for "address1_desc".
   - address2_transport : as for "address2_desc".
   - address3 : a tertiary address for the contact.
   - address3_desc : as for "address1_desc".
   - address3_transport : as for "address3_desc".
   - assistant_name : name of assistant to contact (only for
       44 contacts).
   - assistant_phone : phone number of assistant to contact (only
       for 40 contacts).
   - business_address : normally the full business mailing address 
       of the contact, but sometimes just a city name.
   - business_city : the city of the contact's business address
       (though occasionally a full address is placed here).
   - business_country : the country of the contact's business
       address, but often in fact used for the zip or postal
       code.  Country names include variant or mis-spellings.
   - business_fax : business fax number of the contact.
   - business_homepage : homepage of the business the contact
       works for (not generally of the contact's own business
       web page).
   - business_phone : contact's business phone number.
   - business_phone2 : alternative business phone number for
       contact, but often in fact holds a country name.
   - business_postal_code : business postal or zip code for
       contact.
   - business_state : state of contact's business address
       (but sometimes city or country names are found instead).
   - business_street : street number and name of contact's
       business address.
   - car_phone : number of contact's car phone (incredibly, there
       are 65 of these).
   - company_main_phone : the main or central phone number of the 
       company.
   - company_name : the name of the company the contact works
       for.
   - def_postal_address : default postal address for the contact.
       The same as business_address in all but under 70 cases.
   - department : the company department the contact works for
       (only present in 700-odd cases).
   - display_name_prefix : prefix for contact's display name
       (most common is "Herr" [sic.], with several "Mr.", 
       "Frau" [sic.], and "CEO", and a couple of "Chief
       Smarty Pants".
   - EXT : phone extension of contact (mostly for employees of
       AvocadoIT).
   - first_name : first name of contact.
   - followup : present in around 300 cases, all but a handful
       of which value the value "Follow up".
   - full_name : full name of contact.
   - gender : gender of contact ("unspecified" in all save a
       handful of cases).
   - initials : the initials of the contact.
   - isdn_phone : an alternative location for the contact's
       phone extension.
   - job_title : contact's job title.
   - mail_permission : always has value "0"; meaning unclear.
   - manager_name : name of contact's manager.  Occurs in
       91 cases.
   - middle_name : middle name of contact
   - office_loc : contact's office number, or city, or
       simply "offsite".
   - personal_homepage : found in 39 contacts.  In most
       cases, is actually a company home page.
   - primary_fax : main fax number of contact (only 18 instances)
   - primary_phone : main phone number of contact (a few
       hundred instances)
   - rich_text : possibly, whether the contact prefers to receive
       emails in RTF format.  "0" for all save two cases that are
       "1".
   - suffix : suffix to contact's name.  Sometimes academic
       or professional qualifications; sometimes country or 
       company name; sometimes personal name modifier such
       as "II" or "Jr"
   - surname : contact's surname.
   - work_address_city : city part of contact's work address.
       Identical to "business_city" for all except a few dozen
       entries.
   - work_address_country : country part of contact's work address.
       Less frequently contains a zip or postal code than 
       "business_country".
   - work_address_postalcode : postal code of business address.
       Less frequently contains a country name than 
       "business_postalcode".
   - work_address_state : state part of contact's work address.
       Identical to "business_state" for all except a dozen
       entries (and some of them are variant representations).
   - work_address_street : street part of contact's work address.
       Identical to "business_street" for all except a few
       dozen entries.

Certain additional metadata fields containing personal
information have been redacted; see Section 5C.

3G. Reports
-----------

Report items contain metadata reporting the successful completion
of actions in the Exchange system.  For instance, if a sender
sends an email with subject "Foo Bar Baz" and "read_receipt" set
to "1", then when the recipient reads the email, a report with
"subject" of "Read: Foo Bar Baz" will be created in the
recipient's PST folder.  These report items are _not_ rendered
into the text version of the collection; they are _only_ stored
in the XML metadata.

Reports contain only the common metadata fields described in
Section 3A.VIII.


4. TEXT FILES
=============

The text rendering of the collection is stored under the "text/"
subdirectory.  The items for a custodian are stored in a ZIP file
with that custodian's ID as the name, e.g. "text/003.zip".  Each
item (email, attachment or extracted file, or calendar-like item)
is stored in its own file within the zip file.  The name of the
file is the item's ID followed by ".txt".  Contacts and reports
do _not_ have a text rendering.

All text files in the collection are in utf8 encoding.  The
encoding of source files in the input collection was rarely
stated.  Where files could not be interpreted as in utf8 encoding
(around 50,000), conversion was attempted from cp1252 encoding.
Where that too was impossible (103 files), invalid utf8
characters were replaced by unicode character 0xfffd (a question
mark on a black diamond).  Several files (around 2,000) were
originally in a Japanese-specific encoding (mostly iso-2022-jp),
but have become corrupted, either in the PST files themselves or
through the libpst utility used to extract them.  These files
contain '?' (ASCII character code 0x3F) instead of the original
Japanese characters.

Two xml files (064.xml and 283.xml) do not have associated text files.
For 064.xml, the source file name is DS8178.PST, and all the extracted 
items (all audio files) are redacted. For 283.xml, the source file name
is work.pst, and it only contains contacts. Since contacts are not 
rendered into the text version, the corresponding zip file is not
included.

4A. Email text renderings
-------------------------

Emails may come from within MS Exchange or from a non-Exchange,
SMTP client.  SMTP emails come with full SMTP header information,
and this is retained verbatim in the text version of the email.  
Exchange emails (marked with access type EX in the metadata),
however, do not have instantiated headers stored in the PST file.

4A.I: Reconstructed Headers
...........................

For the convenience of users wishing to index the extracted text
files with no or minimal reference to the metadata, we have
reconstructed email headers RFC-822-like format for Exchange
emails.  The reconstructed header fields are:

4A.I(i): From

- name taken from the "outlook_sender_name" metadata field, with
    email taken from the "sender_address" metadata field if that
    contains an RFC822 address, or email address guessed from
    name otherwise (see below)

4A.I(ii): To

- name(s) taken from "sentto_address", address(es) guessed from
    name(s) (see Section 4A.II below)

4A.I(iii): Cc

- name(s) taken from "cc_address", address(es) guessed from
  name(s) (see Section 4A.II below)

4A.I(iv): Bcc

- name(s) taken from "bcc_address", address(es) guessed from
    name(s) (see Section 4A.II below)

4A.I(v): Subject

- taken from "subject" metadata field.

4A.I(vi): Date

- taken from "sent_date" metadata field, reformatted in 
    RFC822 format.

4A.I(vii): Message-ID

- taken from "messageid" metadata field.

4A.I(viii): In-Reply-To

- taken from "in_reply_to" metadata field.

For SMTP emails, the header is provided in the PST file, and this
is prepended to the body, separated by a blank line, in the text
rendering.  In some cases, the header appears to be corrupted
(whether due to issues with Outlook, or with libpst, or perhaps
with the original mailer); for instance, email addresses are
malformed, or the "To: " header is empty and has "Subject: "
appearing directly after on the same line.  These corruptions are
left as-is in the text rendering; some care must be employed when
using standard email parsers to parse these emails (for instance,
version 1.4 of the Java Mail API raises parse exceptions on
around 650 of the emails in the collection).

4A.II: Reconstructed email addresses
....................................

When reconstructing email headers for Exchange messages, we
attempted to assign an RFC email address to every name, as this
information is not directly provided in the PST files.  To
find these name--address mappings, we look in three sources:

   a.) Contact files
   b.) The "outlook_sender_name" and "sender_address" fields of
         SMTP messages.
   c.) The "outlook_recipient_name" and "sentto_address" fields
         of SMTP messages.

The order of precedence for selecting addresses (from
highest precedence to lower) was:

   1. addresses with "avocadoit" in them from contact files.
   2. addresses with "avocadoit" in them from sources b.) and
         c.)
   3. addresses without "avocadoit" from all sources

Within a precedence category, the address with the highest
total appearance count was selected.  Three addresses were
manually assigned:

      "sriram" <sriram@avocadoit.com>
      "David Hosch" <david.hosch@avocadoit.com>
      "DA" <doug.adkins@everypath.com>

These manual assignations were necessary because of confusion
caused by aliased mail addresses such as "support@avocadoit.com"
that were redirected to a custodian.

Where no email address could be found for a name, we generate
a fake email address of the form "PERSON.NAME@no-address.com".


4B. Attachment and extracted file text rendering
------------------------------------------------

Text is extracted from attachments and extracted files using
Apache Tika 1.1.  If any text is successfully extracted (and the
file is not redacted; Section 5B), then the extracted text is
place in the text rendering of the collection.  If no text is
successfully extracted (for instance, the file is an image), then
no text file is created.  Container items do not have text
directly extracted from them, though the extracted files they
contain may have extracted text.


4C.  Calendar-like item text rendering
--------------------------------------

For the convenience of users wishing to index the extracted text
files with no or minimal reference to the metadata, we create a
text file presentation of calendar items in the text section of
the collection.  The calendar item is in pseudo-RFC822 format,
with a header containing <key>:<value> pairs per line, and a
body, with a blank line separating the two.

The header fields are "Subject" and (for appointment items only)
"Location", "Start", "End", and "Recurrence".  The body is as
extracted from the PST file.  Note that the body is _not_
contained in the XML metadata file; it is only contained in the
text rendering of the calendar item.


5. REDACTION
============

Three forms of redaction have been performed on the collection:
content redaction; item level redaction; and metadata redaction.

5A. Content redaction
---------------------

We redact credit card numbers, social security numbers, and the
name of the company.  Credit card numbers and social security
numbers are redacted by overwriting them with randomly-generated,
syntactically valid simulacrums, with the one identifier being
replaced by the same random simulacrum each place it occurs.  The
name of the company is replaced with "AvocadoIT".

5B. Item redaction
------------------

We fully redact all attachments and extracted files of the following 
mime types:

   application/x-msaccess
   application/vnd.ms-excel
   application/octet-stream
   application/pgp-keys

We also fully redact top-level items of PST types:

   stickynote
   journal
   other

We fully redact all emails with a "sensitivity" of "private"
or "personal" (67 of the former, 10 of the latter).

Redacted items are listed in the metadata file, with their
mimetypes, but no other metadata given, and no extracted text is
written into the text part of the collection.

5C. Metadata redaction
----------------------

We redact the following metadata fields from each contact item:

  - birthday
  - home_address
  - home_city
  - home_country
  - home_fax
  - home_phone
  - home_phone2
  - home_postal_code
  - home_state
  - home_street
  - mobile_phone
  - other_address
  - other_city
  - other_country
  - other_phone
  - other_postal_code
  - other_state
  - other_street
  - pager_phone
  - spouse_name
  - wedding_anniversary


6. COLLECTION STATISTICS
========================

There are a total of 2,033,740 items in the collection, broken up
into the following categories:

               Type    Total  Redact     Dup   Final No Text    Text
    --------------- -------- ------- ------- ------- ------- -------
              Email   938035       0  323574  614461       0  614461
         Attachment   325506  110023  126411   89072   23731   65341
     Extracted file   556286  298022  172870   85394   14775   70619
            Contact    59455       0       0   59455   59455       0
        Appointment    76902       0       0   76902       0   76902
               Task    15474       0       0   15474       0   15474
           Schedule    26980       0       0   26980       0   26980
             Report     7206       0       0    7206    7206       0
            Journal     4204    4204       0       0       0       0
         Stickynote     2232    2232       0       0       0       0
              Other    21460   21460       0       0       0       0
    =============== ======== ======= ======= ======= ======= =======
              Total  2033740  435941  622855  974944  105167  869777

where the column "Final" is the count of unredacted items in
the metadata (not included container items), and "Text" is the
number of non-duplicate items contained in the text rendering
of the collection.

The number of attachments and extracted files redacted by mime
type is as follows:

            Type  Attachments  Extracted
    ------------ ------------ ----------
    octet-stream        93678     297292
    vnd.ms-excel        16321        706
      x-msaccess           22         24
        pgp-keys            2          0
    ============ ============ ==========
           Total       110023     298022