Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract meta data from Microsoft PST file with Python and pypff

Tags:

python

pst

libpff

I am encountering a problem extracting meta data from a PST file.

As you can see in the code I am using pypff to read the PST file. I need to extract the following data from the emails: sender, recipient, subject, time and date and of course the email content.

But apparently I'm too stupid for that, because I just can't find the recipient.

I'm asking you professionals for help, maybe you know a better way to do this. I have already thought about "unpacking" all .msg from the PST into a folder and then itterrating over it. But I wouldn't know how to do that either.

Thanks in advance for your answers and help.

# Retrieving E-Mails from a PST file
#File opening

#Fist we load the libraries
import pypff
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#Then we open the file: the opening can neverthless be quite long
#depending on the size of the archive.
pst = pypff.file()
pst.open("PathTo.pst")

# Metadata extraction

#It is possible to navigate through the structure using the functions
#offered by the library, from the root:
root = pst.get_root_folder()

#To extract the data, a recursive function is necessary:
def parse_folder(base):
    messages = []
    for folder in base.sub_folders:
        if folder.number_of_sub_folders:
            messages += parse_folder(folder)
        print(folder.name)
        for message in folder.sub_messages:
            print(message.transport_headers)
            messages.append({
                "subject": message.subject,
                "sender": message.sender_name,
                "datetime": message.client_submit_time,
            })
    return messages

messages = parse_folder(root)
like image 584
Legion Avatar asked Dec 22 '25 02:12

Legion


1 Answers

Actually is not too easy to find the recipient because usually you get a pst file exporting from one single recipient, I don't know if this will help you but right now I'm in a similar issue, so, in theory you can extract Original-Recipient or Final-Recipient from the message object by parsing transport_headers, using something like this:

for hp in message.transport_headers.split('\n'):
    pts = re.findall(r'^([^:]+): (.+)\r$', hp)
    if pts:
        key = pts[0][0].capitalize()
        headers[key] = val


or maybe something like this... 

for record_set in pst_message.record_sets:
    for entry in record_set.entries:
        print(f"entry type {hex(entry.get_entry_type())} {entry.get_value_type()} {entry.data)})
like image 127
Raul Fernando Casallas Malaver Avatar answered Dec 24 '25 21:12

Raul Fernando Casallas Malaver



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!