Avocado Research Email Collection Version 1.0.3 Douglas W. Oard William Webber David A. Kirsch Sergey Golitsynskiy README Date: Mon Nov 24 10:18:44 EST 2014 1. INTRODUCTION =============== The Avocado Research Email Collection ("the Avocado collection") is a corpus of emails and attachments, distributed for use in research and development in e-discovery, social network analysis, and related fields. Please carefully read and adhere to the usage agreements before beginning work on the collection. All users must personally sign the Avocado Collection End User Agreement before working with the collection, and the collection must be stored on a password protected computer in a way prevents access to the collection by anyone who has not personally signed the Acocado Collection End User Agreement. The Avocado collection consists of records taken from the PST files of accounts of a now-defunct IT company. We refer to this company using the pseudonym "AvocadoIT"; references to the company name in the collection are also replaced with this pseudonym. A PST, or "Personal Storage Table", file is used by MS Outlook to store emails, calendar entries, contact details, and related information. We will refer to the processed contents of these PST files as the "personal folders" of these accounts. The source data for the Avocado collection consisted of the PST files for 282 accounts. Most of these accounts are those of employees of AvocadoIT; the remainder represent shared accounts such as "Leads", or system accounts such as "Conference Room Upper Canada". Data was extracted from these PST files using libpst version 0.6.54. Three PST files produced no output: one was corrupt and two were empty. The Avocado collection consists of the processed personal folders of the remaining 279 accounts. We follow e-discovery practice by referring to each of these accounts as a "custodian", although some of them do not correspond to individual humans. The collection is divided into metadata and text. The metadata is represented in XML, with a single top-level XML file listing the custodians, and then one XML file per custodian listing all items extracted from that custodian's PST files. The full XML tree can be read by loading the top-level file with an XML parser that handles directives, but the resulting in-memory DOM tree is very large (over 32GB on our machines), and it may be more practical to process the per-custodian XML files one at a time. We describe the top-level organization of the custodian XML metadata files in Section 2, and the detailed metadata for each item in the personal folder in Section 3. The text contains the extracted text of the items in the custodians' folders, with the extracted text for each item being held in a separate file. The text files are then zipped up into a zip file per custodian. It may be more efficient to process the text files directly in their zip folders, rather than unzipping them first, as the latter may lead to them being scattered across the physical disk (depending on the behaviour of your file system). The contents of the text files are described in Section 4. Section 5 describes the redaction performed upon the collection, and Section 6 provides collection statistics. 1A. A note on processing ------------------------ Metadata stored in the PST files is reproduced with minimal processing (aside from redaction) in the: ... tag for the corresponding item. For emails, this metadata was either created natively by Exchange (for EX transport types) or parsed by Exchange from SMTP headers (for SMTP types). We have made no effort to canonicalize email addresses, standardize headers, align header and metadata fields, regularize metadata names, or the like. Metadata fields are reproduced as found in the PST files, while SMTP headers are given as found in the email text itself. The only modification we have made is to reconstruct minimal email headers for EX-transport emails (Section 4A.I), for the convenience of those wishing to index the extracted text with minimal reference to the XML metadata files. Even in this case, however, the user may need to refer back to the XML files for other metadata, such as the calculated duplicates (Section 3B.III). 2. CUSTODIAN METADATA ===================== There is an XML metadata file describing the contents of the personal folder for each custodian. The XML data for a custodian has the following form: The "id" attribute gives the identifier for the custodians, which is a number assigned in alphabetical order of the name of the custodian's PST file, counting up from 001. The "items" attribute gives a count of the total number of items in the custodian's PST file. The child elements are described in the following subsections. All XML metadata files are encoding in utf-8. 2A. Source file --------------- The source file tag gives basic information about the PST file that the custodian's personal folder has been extracted from. It has the form: FOO.PST where "FOO.PST" is the name of the PST file, and the "size" attribute gives the size in MB of the PST file. 2B. PST folder structure ------------------------ PST files have an internal, tree-like, folder organization, with each PST item being contained within a folder. The folder structure is represented as follows: FOLDER_NAME ... ... The elements may be nested to arbitrary depth. The "items" attribute counts the number of immediately contained items (not including those in subfolders), and the "subfolders" attribute counts the number of immediately contained subfolders (not including sub-subfolders). The folder "id" is unique across the collection. The tag contains the PST folder name. 2C. Items --------- The content of the personal folder is stored in items of different types. The items are listed inside the , as follows: Items are not nested; their folder structure is represented by back references to the section. The different item types, and their common and type-specific metadata, is described in Section 3. 3. ITEM METADATA ================ All extractable items are extracted from the custodian's personal folder, though some item types are redacted (see Section 5B). Retained item types are: emails attachments extracted files calendar-like items (appointments, schedules, and tasks) contacts reports We describe fields and formats common to all items first, and then each item type individually. 3A. Common item attributes -------------------------- 3A.I: Item tag ............... The XML record for each item is an element, of form: The attribute "id" gives the item id (Section 3A.II), and "type" gives the type of the item. The last three types are redacted (Section 5B). The item tag may also have the attribute "duplicates" for the count of duplicates (see Section 3B.III for email duplicates, 3C.III for attachment duplicates, and 3D.III for extracted file duplicates); "replies" for the count of replies, if the item is an email (see Section 3B.II); "redacted" if the item is fully redacted (see Section 5); "attachments" for the number of attachments (only if not itself an attachment or extracted file) (Section 3C); and "pst-folder-id" for the PST folder that this item was found in (not for extracted files) (Section 2). 3A.II: Item id ............. Each item in the collection is assigned an item id. Excepting attachments (Section 3C) and extracted files (Section 3D), these ids have the form: -- Attachments files extend this by sub-numbering: -_- and extracted files sub-number the attachments (possibly with nesting): -__- The custodian IDs count up sequentially from 001 (3 digits), while the item numbers count up sequentially from 000001 (6 digits), for each custodian. 3A.III: Files tag ................. The tag has the form: The type is always "text" for this rendering of the collection. The path gives the path from the top-level directory of the hierarchy to the text file, assuming the text ZIP files are unzipped. The text file is contained under the "text" directory in a folder named with the custodian numeric id, (or, in the zipped version of the collection, a filename named .zip). The name of the file within this directory (or zip file) is the id of the item, with extension ".txt". The tag is only present for emails, calendar-like items, and for attachments or extracted files where at least 1 character of text has been extracted, as only these files have text extracted from them. 3A.IV: Relationships tag ........................ The tag has the form: For the relationship, see Section 3C. For the relationship, see Section 3B.III for email de-duplication, Section 3C.III for attachment de-duplication, and Section 3D.III for extracted file de-duplication. For the relationship, see Section 3B.II. For the relationship, see Section 3D. 3A.V: File-data tag ................... The tag has the form: CCC .XXX CCC .XXX It is present only for attachment and "extracted file" types; see Section 3C and Section 3D below. 3A.VI: Container-data tag ......................... The tag has the form: It is only found in attachments or extracted files that are containers. See Section 3D for details. 3A.VII: Metadata tag .................... The tag has the form: CCC ... See the respective item types below for more comments on metadata. Extracted files and attachments have no metadata tags. 3A.VIII: Common metadata fields ............................... The following metadata fields appear in multiple different item types. The meanings of the fields are as inferred by us from the data itself, and from online references for Outlook, PST, and email data. - create_date : the date and time the item was created, in UTC - modify_date : the date and time the item was last modified, in UTC - file_as : a human-readable reference for the item (possibly automatically generated by Outlook or Exchange). For emails, this may be the sent-to address, or the name of a file attached to the email. For contacts, this is the name of the contact. For appointments, this appears to be the name of the person making the appointment, which is not otherwise captured. For tasks, it is sometimes the name of the task, sometimes the name of the file attached to the task, and sometimes seems to be an identifier of the alias to whom the task is assigned (such as "@HelpDesk"). - outlook_version : the version of Outlook used to create the item (mostly 9.0, but with other versions varying from 8.0 up to 10.0). - response_requested : whether the creator of the item would like a response. Found in emails, calendar items, reports, and contacts, though semantics in reports and contacts is unclear. - subject : the subject of the email, of the meeting or task for a calendar item, or a summary of the report type for reports. - flags : a bitmap in which the bits have the following semantics (taken from libpst.h): PST_FLAG_READ 0x01 PST_FLAG_UNMODIFIED 0x02 PST_FLAG_SUBMIT 0x04 PST_FLAG_UNSENT 0x08 PST_FLAG_HAS_ATTACHMENT 0x10 PST_FLAG_FROM_ME 0x20 PST_FLAG_ASSOCIATED 0x40 PST_FLAG_RESEND 0x80 PST_FLAG_RN_PENDING 0x100 PST_FLAG_NRN_PENDING 0x200 3B. Email --------- Email items have two transport types: Exchange (EX) or SMTP (SMTP). The type of transport affects the address metadata fields: Exchange messages have names or LDAP identifiers for addresses; SMTP messages have RFC822 email addresses. The transport type is specified in the "sender_access" metadata field. 3B.I: Email metadata ...................... Email items commonly have the following metadata fields: - arrival_date : the date and time at which the email arrived, in UTC. - autoforward : whether the email was automatically forwarded to another email address. The forwarded-to address does not appear to be captured in the email metadata. - bcc_address : a list of addresses the email was BCC'ed to. The addresses are expressed either as names (for Outlook messages) or as RFC822 email addresses (for SMTP messages). This metadata field is only found in the personal folder of the email's sender. - cc_address : a list of addresses the email was CC'ed to. Format as for "bcc_address". - delete_after_submit : inverse of whether a copy of the email should be saved at the sender's end. Semantics are unclear: some emails marked "deleted_after_submit" are still found in the Outbox of the sender (particularly notifications about viruses). - delivery_report : does the sender request an automatic report of the message having been successfully delivered? This will sometimes (though not always) match up with a "report" item that has subject "Delivered: ". - importance : the sender-assigned "importance" of an email, displayed to the recipient by Outlook. Either "normal" (96% of cases), "high" (3.6% of cases), or "low" (a handful of cases). - in_reply_to : the message id that this email is in reply to. If that email is also contained in the collection, we record a relationship (Section 3B.II). - message_cc_me : is the custodian in the list of CC addresses? - message_recip_me : is the custodian in either the list of CC addresses or the list of TO addresses? - message_to_me : is the custodian in the list of TO addresses? - messageid : the unique identifier for this email. - original_cc : only occurs a handful of times, when it gives a longer list of CCs than the cc_address field. The meaning of the difference is unclear. - original_to : only occurs a handful of times, when it gives an alternative recipient list to the sentto_address field. The meaning of the difference is unclear. - original_sensitivity : differs from "sensitivity" (see below) in 144 instances. In all but one of these, "sensitivity" is set to a higher level of sensitivity than "original_sensitivity". The meaning of the difference is unclear (perhaps an amendment of the recipient?). - outlook_recipient_name : the name of the recipient by which this email made it into the custodian's personal folder; that is, the name of the custodian as Outlook understands it, though sometimes it is an alias or a role (such as "@HelpDesk"), and in a few cases the name appears to be erroneous (or possibly indicates that a custodian took over another user's personal folder). - outlook_sender_name : the name of the sender of the email, as a natural name (without email address). - priority : an indicator of the priority of the email. One of "normal" (76%), "nonurgent" (21%), or "urgent" (3%). I believe this is explicitly set by the sender, and is intended for the attention of the recipient. - processed_subject : the original subject of the thread to which this email belongs. Frequently, the "subject" field with "RE:" and/or "FW:" stripped from the front - read_receipt : whether the send requests a report when the recipient reads the email. Sometimes (though now always) matches up with a "report" item that has subject "Read: ". - recip_access : the transport mechanism by which the recipient received the email. One of EX (most common) or SMTP (about 1% of cases). - recip_address : the address of the recipient. For messages delivered via SMTP, an RFC822 address; for messages delivered via EX, an LDAP identifier. - reply_requested : whether a reply is requested from the recipient. Holds the same value as "response_requested" for all but a handful of emails. - reply_to : the address to which replies should be sent, when this is not the same as the sender. For SMTP messages, this is an RFC822 email address; for EX messages, this is a natural name. - return_path_address : addresses for bounces to go to. Only for SMTP messages. - sender_access : the transport method which the email came from. Most common values are "EX" and "SMTP", but with a couple of hundred "FAX" (an automated fax-to-email bridge, with the fax attached as a TIFF), a hundred-odd NONE, and a handful of HANDMAIL and SYSTEM. - sender_address : the address of the sender. For SMTP messages, this is an RFC822 email address; for EX messages, this is an LDAP identifier. - sender2_access : the transport method corresponding to the "sender2_address". This may differ from "sender_access" if "sender2_address" (see next) differs from "sender_address". - sender2_address : for six-hundred-odd emails, this field differs from the "sender_address" field. Common cases are where the former gives an RFC822 address and the latter an Exchange LDAP identifier; or the former gives a mailing list address, and the latter a personal address. - sensitivity : the sensitivity of the email, as (I believe) explicitly set by the sender. One of "none" (almost all emails), "company confidential" (235 emails), "personal" (10), or "private (67). See also "original_sensitivity". All emails with "sensitivity" of "private" or "personal" have been redacted (Section 5B). - sent_date : the date and time the email were sent, in UTC. - sentto_address : the address or addresses to which the email was addressed. Addresses are either natural names or RFC822 email addresses. This will not be the same as "outlook_recipient_name" if: + the message is outgoing + the custodian was CC'ed + the custodian was part of a list of TO addresses + the email was sent to a group list (e.g. "All Employees") to which the custodian belonged Various other metadata fields occur less frequently in emails, such as "X-" header fields pulled in from SMTP messages. Certain other metadata fields are present in the PST version of the email, but have been removed from the processed collection as redundant or uninformative: + htmlbody : an HTML rendering of the body + Internet Charset Body: a rendering of the body in an "internet-friendly" character set + conversion_prohibited : value is always "0" + ndr_diag_code: value is always "0" + ndr_reason_code: value is always "0" + ndr_status_code: value is always "0" + rtf_body_tag: information about RTF rendering of message, which is not retained. + rtf_body_char_count: (ditto) + rtf_body_crc: (ditto) + rtf_body_in_sync: (ditto) + rtf_in_sync: (ditto) + rtf_ws_prefix_count: (ditto) + recip2_access : always the same as recip_access + recip2_address : always the same as recip_address + outlook_sender : can be reconstructed as sender_access + ":" + sender_address + outlook_sender2 : can be reconstructed as sender2_access + ":" + sender2_address + outlook_recipient : can be reconstructed as recip_access + ":" + recip_address + outlook_recipient2: can be reconstructed as recip2_access + ":" + recip2_address 3B.II: Email replies .................... PST metadata includes an "in_reply_to" field for certain emails, containing the message ID of the email that this email is replying to. We resolve and mark these reply-to relationships. In the replying email, the relationship is marked by the tag: where the ID refers to the collection ID (_not_ the message id) of the replied-to email. In the replied-to email, the number of replies (but not the reply ids) is recorded as the "replies" attribute of the tag. 3B.III: Email de-duplication .......................... Two emails are considered duplicates if and only if they have the same message ID and the same subject line. Note that some messages with the same message ID have different subjects and contents. For instance, the notification that a message could not be delivered or contained a virus has the same message ID as the original message, but different contents. Surprisingly, message IDs are sometimes identical even for quite different messages. It also happens that two emails are seemingly the same (same message ID, sender, and date), but have different subjects. The deduplication carried out here should be considered a reasonable approximation, not perfect. We attempt to identify the sender's version of an email and make it the canonical version in a duplicate set. The sender version is detected first as the version for which "outlook_sender_name" is the name of the custodian who owns the mailbox; this is determined by looking at the "outlook_recipient_name" values of messages in the mailbox where the "message_to_me" metadata field is "1". If no emails in the duplicate set have the custodian's name as the sender name, or if more than one does, then the email having a "bcc_address" field is selected. If more than one email tie, then an arbitrary email is chosen. Note that the sender may have more than one copy of the email in their mailbox; for instance, one may be in their "sent" folder, while they may have explicitly CC'ed another to themselves. The duplicate versions of an email point to the canonical version with the tag: while the canonical version of the email will count the number of duplicates (not including itself) in the "duplicates" attribute of the tag. An email that is not in a duplicate set has neither the "duplicates" attribute nor the tag. Both canonical and duplicate versions of an email are listed in full in metadata and as text. De-duplication only records the duplicate links between these items. Duplicate emails may have differing attachments. Sometimes files attached by the sender are stripped off at the receiver (VCARDs, for instance, or other attachment types). Occasionally, the receiver has attachments that the sender did not send; these appear to be automatically-generated attachments, such as notifications by virus checkers. 3C. Attachments --------------- An attachment is a file attached to another PST item type. Note that not only emails may have attachments, but also calendar-like items, contacts, and even reports. Attachment ids are counted sequentially from the item they are attached to. So, for instance, the first attachment of item: 0001-000001-EM will be: 0001-000001_1-AT and the second will be 0001-000001_2-AT Attachments have no metadata tags. 3C.I: Attachment links ...................... The item that an attachment is attached to is identified by the: tag. Items that are capable of having attachments (all except for attachments and extracted files) have the number of their attachments counted in the "attachments" attribute of the tag. 3C.II: Attached file .................... Information about the attached file is held in the tag, as follows: CCC .XXX CCC XXX The "size" attribute is the size in bytes of the native format of the attachment. The "extracted-chars" gives the number of characters extracted from the file as text. Text extraction is performed using Tika version 1.1. If any text is successfully extracted (and the item is not redacted), then the extracted text can be found in the location given by the element. The and are taken from PST metadata. Some attachments are anonymous, and have no name or extension. The and is as detected by the UNIX "file" utility. 3C.III: Attachment de-duplication ................................. Attachments are de-duplicated based upon the md5sum of the native format of the file. The canonical version of a set of duplicate attachments is the first one encountered in the collection. Duplicates and canonical versions are marked using the tag and "duplicates" attribute, as with emails (see Section 3B.III). Attachment and email de-duplication is calculated independently: two emails can have duplicate attachments without themselves being duplicates. Some items (for example, vcards) have a very large number of duplicates in the collection. As with emails, duplicate attachments are included verbatim in the collection. 3D. Extracted files ................... An attached file could be a container which itself holds other files. These files are recursively extracted, with the extracted files having type "extracted file". Container files of type ZIP, TAR, GZIP, and PST are handled. GZIP containers will only have a single extracted file. Note that PST attachments are handled, and their contents fully extracted. The ID of an extracted file is numbered sequentially from the attachment or extracted file they are extracted from. So, for instance, if: 0052-000871-AT is a container, then the first item in the container is: 0052-000871_1-EX and if that item itself is a container, then the first item inside that container is: 0052-000871_1_1-EX and so forth. Extracted files have no metadata tags. 3D.I: Extracted file links .......................... A file that is a container file has a tag with the following form: The "extracted" attribute gives the number of items extracted from the container. An extracted file points back to the container it was extracted from using the: field. 3D.II: Extracted file ..................... Information about the extracted file is held in the tag. This has the same form and semantics as for attachments (Section 3C.I). 3D.III: Extracted file de-duplication ..................................... Extracted files are de-duplicated based on the md5sum values of their native versions. De-duplication is performed and noted in the same way as for attachments (Section 3C.III). Extracted files and (top-level) attachments can be duplicates of each other. 3E. Calendar-like items ----------------------- There are three item types that we class as calendar-like items: appointments; schedules; and tasks. Appointments are full calendar items, with start and end times, alarms, reminders and so forth (Section 3E.I). Schedules and tasks have only the common metadata fields described in Section 3A.VIII, most particularly "create_date" and "subject", as well as a "body" that goes into the text rendering of the item (Section 4C). Appointments may have the following metadata items: - alarm : whether to raise an alarm to the user shortly before the appointment. - alarm_minutes : the number of minutes before the appointment to raise the alarm. - all_day : is this an all-day appointment? - end : the end time for the appointment (in UTC). - is_recurring : is this a recurring appointment? - label : always "None" in this collection. - location : the location of the appointment, as a natural name (XXX's office, YYY conference room); or a note saying that the location is to be determined; or the name of the person the meeting is with; or the phone number to call for a teleconference. - recurrence_description : a textual description of the meeting recurrence (e.g. "every Friday from 11:00AM to 12:30PM"). - recurrence_end : the date and time on which the appointment stops recurring (in UTC). - recurrence_start : the date and time on which the appointment starts recurring (in UTC). - recurrence_type : the frequency of recurrence of the item ("daily", "weekly", "monthly", or "yearly") - reminder : the date and time at which to send a reminder (in UTC). - showas : how the time is shown in the user's calendar ("busy", "fee", "out of office", or "tentative"). - start: start time for the appointment (in UTC). - timezonestring : a string giving the timezone in which the meeting is being scheduled. 3F. Contacts ------------ Contact items hold metadata information about the contacts held in a custodian's PST file. Contact items are _not_ rendered into the text version of the collection; they are _only_ stored in the XML data. Vcards that are attached to emails, however, are stored in Vcard format in the text representation, and are not parsed into metadata in the XML metadata representation. The metadata fields occurring in more than a handful of contacts (in additional to the common metadata fields described in Section 3A.VIII) are: - account_name : typically, name of user's system login account. - address1 : the person's "address", normally either as an RFC822 email address or as an LDAP identifier, but sometimes simply their name. - address1_desc : where this differs from "address1", it typically gives the natural name of the person identified by address1. - address1_transport : the transport mechanism for sending messages to "address1". One of "SMTP", "EX", "FAX", or "MAILTO". - address2 : a secondary address for the contact (such as an alternative email address). - address2_desc : as for "address1_desc". - address2_transport : as for "address2_desc". - address3 : a tertiary address for the contact. - address3_desc : as for "address1_desc". - address3_transport : as for "address3_desc". - assistant_name : name of assistant to contact (only for 44 contacts). - assistant_phone : phone number of assistant to contact (only for 40 contacts). - business_address : normally the full business mailing address of the contact, but sometimes just a city name. - business_city : the city of the contact's business address (though occasionally a full address is placed here). - business_country : the country of the contact's business address, but often in fact used for the zip or postal code. Country names include variant or mis-spellings. - business_fax : business fax number of the contact. - business_homepage : homepage of the business the contact works for (not generally of the contact's own business web page). - business_phone : contact's business phone number. - business_phone2 : alternative business phone number for contact, but often in fact holds a country name. - business_postal_code : business postal or zip code for contact. - business_state : state of contact's business address (but sometimes city or country names are found instead). - business_street : street number and name of contact's business address. - car_phone : number of contact's car phone (incredibly, there are 65 of these). - company_main_phone : the main or central phone number of the company. - company_name : the name of the company the contact works for. - def_postal_address : default postal address for the contact. The same as business_address in all but under 70 cases. - department : the company department the contact works for (only present in 700-odd cases). - display_name_prefix : prefix for contact's display name (most common is "Herr" [sic.], with several "Mr.", "Frau" [sic.], and "CEO", and a couple of "Chief Smarty Pants". - EXT : phone extension of contact (mostly for employees of AvocadoIT). - first_name : first name of contact. - followup : present in around 300 cases, all but a handful of which value the value "Follow up". - full_name : full name of contact. - gender : gender of contact ("unspecified" in all save a handful of cases). - initials : the initials of the contact. - isdn_phone : an alternative location for the contact's phone extension. - job_title : contact's job title. - mail_permission : always has value "0"; meaning unclear. - manager_name : name of contact's manager. Occurs in 91 cases. - middle_name : middle name of contact - office_loc : contact's office number, or city, or simply "offsite". - personal_homepage : found in 39 contacts. In most cases, is actually a company home page. - primary_fax : main fax number of contact (only 18 instances) - primary_phone : main phone number of contact (a few hundred instances) - rich_text : possibly, whether the contact prefers to receive emails in RTF format. "0" for all save two cases that are "1". - suffix : suffix to contact's name. Sometimes academic or professional qualifications; sometimes country or company name; sometimes personal name modifier such as "II" or "Jr" - surname : contact's surname. - work_address_city : city part of contact's work address. Identical to "business_city" for all except a few dozen entries. - work_address_country : country part of contact's work address. Less frequently contains a zip or postal code than "business_country". - work_address_postalcode : postal code of business address. Less frequently contains a country name than "business_postalcode". - work_address_state : state part of contact's work address. Identical to "business_state" for all except a dozen entries (and some of them are variant representations). - work_address_street : street part of contact's work address. Identical to "business_street" for all except a few dozen entries. Certain additional metadata fields containing personal information have been redacted; see Section 5C. 3G. Reports ----------- Report items contain metadata reporting the successful completion of actions in the Exchange system. For instance, if a sender sends an email with subject "Foo Bar Baz" and "read_receipt" set to "1", then when the recipient reads the email, a report with "subject" of "Read: Foo Bar Baz" will be created in the recipient's PST folder. These report items are _not_ rendered into the text version of the collection; they are _only_ stored in the XML metadata. Reports contain only the common metadata fields described in Section 3A.VIII. 4. TEXT FILES ============= The text rendering of the collection is stored under the "text/" subdirectory. The items for a custodian are stored in a ZIP file with that custodian's ID as the name, e.g. "text/003.zip". Each item (email, attachment or extracted file, or calendar-like item) is stored in its own file within the zip file. The name of the file is the item's ID followed by ".txt". Contacts and reports do _not_ have a text rendering. All text files in the collection are in utf8 encoding. The encoding of source files in the input collection was rarely stated. Where files could not be interpreted as in utf8 encoding (around 50,000), conversion was attempted from cp1252 encoding. Where that too was impossible (103 files), invalid utf8 characters were replaced by unicode character 0xfffd (a question mark on a black diamond). Several files (around 2,000) were originally in a Japanese-specific encoding (mostly iso-2022-jp), but have become corrupted, either in the PST files themselves or through the libpst utility used to extract them. These files contain '?' (ASCII character code 0x3F) instead of the original Japanese characters. Two xml files (064.xml and 283.xml) do not have associated text files. For 064.xml, the source file name is DS8178.PST, and all the extracted items (all audio files) are redacted. For 283.xml, the source file name is work.pst, and it only contains contacts. Since contacts are not rendered into the text version, the corresponding zip file is not included. 4A. Email text renderings ------------------------- Emails may come from within MS Exchange or from a non-Exchange, SMTP client. SMTP emails come with full SMTP header information, and this is retained verbatim in the text version of the email. Exchange emails (marked with access type EX in the metadata), however, do not have instantiated headers stored in the PST file. 4A.I: Reconstructed Headers ........................... For the convenience of users wishing to index the extracted text files with no or minimal reference to the metadata, we have reconstructed email headers RFC-822-like format for Exchange emails. The reconstructed header fields are: 4A.I(i): From - name taken from the "outlook_sender_name" metadata field, with email taken from the "sender_address" metadata field if that contains an RFC822 address, or email address guessed from name otherwise (see below) 4A.I(ii): To - name(s) taken from "sentto_address", address(es) guessed from name(s) (see Section 4A.II below) 4A.I(iii): Cc - name(s) taken from "cc_address", address(es) guessed from name(s) (see Section 4A.II below) 4A.I(iv): Bcc - name(s) taken from "bcc_address", address(es) guessed from name(s) (see Section 4A.II below) 4A.I(v): Subject - taken from "subject" metadata field. 4A.I(vi): Date - taken from "sent_date" metadata field, reformatted in RFC822 format. 4A.I(vii): Message-ID - taken from "messageid" metadata field. 4A.I(viii): In-Reply-To - taken from "in_reply_to" metadata field. For SMTP emails, the header is provided in the PST file, and this is prepended to the body, separated by a blank line, in the text rendering. In some cases, the header appears to be corrupted (whether due to issues with Outlook, or with libpst, or perhaps with the original mailer); for instance, email addresses are malformed, or the "To: " header is empty and has "Subject: " appearing directly after on the same line. These corruptions are left as-is in the text rendering; some care must be employed when using standard email parsers to parse these emails (for instance, version 1.4 of the Java Mail API raises parse exceptions on around 650 of the emails in the collection). 4A.II: Reconstructed email addresses .................................... When reconstructing email headers for Exchange messages, we attempted to assign an RFC email address to every name, as this information is not directly provided in the PST files. To find these name--address mappings, we look in three sources: a.) Contact files b.) The "outlook_sender_name" and "sender_address" fields of SMTP messages. c.) The "outlook_recipient_name" and "sentto_address" fields of SMTP messages. The order of precedence for selecting addresses (from highest precedence to lower) was: 1. addresses with "avocadoit" in them from contact files. 2. addresses with "avocadoit" in them from sources b.) and c.) 3. addresses without "avocadoit" from all sources Within a precedence category, the address with the highest total appearance count was selected. Three addresses were manually assigned: "sriram" "David Hosch" "DA" These manual assignations were necessary because of confusion caused by aliased mail addresses such as "support@avocadoit.com" that were redirected to a custodian. Where no email address could be found for a name, we generate a fake email address of the form "PERSON.NAME@no-address.com". 4B. Attachment and extracted file text rendering ------------------------------------------------ Text is extracted from attachments and extracted files using Apache Tika 1.1. If any text is successfully extracted (and the file is not redacted; Section 5B), then the extracted text is place in the text rendering of the collection. If no text is successfully extracted (for instance, the file is an image), then no text file is created. Container items do not have text directly extracted from them, though the extracted files they contain may have extracted text. 4C. Calendar-like item text rendering -------------------------------------- For the convenience of users wishing to index the extracted text files with no or minimal reference to the metadata, we create a text file presentation of calendar items in the text section of the collection. The calendar item is in pseudo-RFC822 format, with a header containing : pairs per line, and a body, with a blank line separating the two. The header fields are "Subject" and (for appointment items only) "Location", "Start", "End", and "Recurrence". The body is as extracted from the PST file. Note that the body is _not_ contained in the XML metadata file; it is only contained in the text rendering of the calendar item. 5. REDACTION ============ Three forms of redaction have been performed on the collection: content redaction; item level redaction; and metadata redaction. 5A. Content redaction --------------------- We redact credit card numbers, social security numbers, and the name of the company. Credit card numbers and social security numbers are redacted by overwriting them with randomly-generated, syntactically valid simulacrums, with the one identifier being replaced by the same random simulacrum each place it occurs. The name of the company is replaced with "AvocadoIT". 5B. Item redaction ------------------ We fully redact all attachments and extracted files of the following mime types: application/x-msaccess application/vnd.ms-excel application/octet-stream application/pgp-keys We also fully redact top-level items of PST types: stickynote journal other We fully redact all emails with a "sensitivity" of "private" or "personal" (67 of the former, 10 of the latter). Redacted items are listed in the metadata file, with their mimetypes, but no other metadata given, and no extracted text is written into the text part of the collection. 5C. Metadata redaction ---------------------- We redact the following metadata fields from each contact item: - birthday - home_address - home_city - home_country - home_fax - home_phone - home_phone2 - home_postal_code - home_state - home_street - mobile_phone - other_address - other_city - other_country - other_phone - other_postal_code - other_state - other_street - pager_phone - spouse_name - wedding_anniversary 6. COLLECTION STATISTICS ======================== There are a total of 2,033,740 items in the collection, broken up into the following categories: Type Total Redact Dup Final No Text Text --------------- -------- ------- ------- ------- ------- ------- Email 938035 0 323574 614461 0 614461 Attachment 325506 110023 126411 89072 23731 65341 Extracted file 556286 298022 172870 85394 14775 70619 Contact 59455 0 0 59455 59455 0 Appointment 76902 0 0 76902 0 76902 Task 15474 0 0 15474 0 15474 Schedule 26980 0 0 26980 0 26980 Report 7206 0 0 7206 7206 0 Journal 4204 4204 0 0 0 0 Stickynote 2232 2232 0 0 0 0 Other 21460 21460 0 0 0 0 =============== ======== ======= ======= ======= ======= ======= Total 2033740 435941 622855 974944 105167 869777 where the column "Final" is the count of unredacted items in the metadata (not included container items), and "Text" is the number of non-duplicate items contained in the text rendering of the collection. The number of attachments and extracted files redacted by mime type is as follows: Type Attachments Extracted ------------ ------------ ---------- octet-stream 93678 297292 vnd.ms-excel 16321 706 x-msaccess 22 24 pgp-keys 2 0 ============ ============ ========== Total 110023 298022