Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing mbox files in Python

Tags:

python

email

mbox

Python newbie here. I want to walk through a large mbox file, parsing email messages. I can do that with:

import sys
import mailbox

def gen_summary(filename):
    mbox = mailbox.mbox(filename)
    for message in mbox:
       subj = message['subject']
       print subj

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print 'Usage: python genarchivesum.py mbox'
        sys.exit(1)

    gen_summary(sys.argv[1])

But I need more control. I need to be able to get the byte position of the start of a given email in the mbox file and I also need to get the number of bytes in the message (as represented on disk). And then in the future, instead of iterating from the beginning of the mbox file, I need to be able to seek to a given message and just parse that (hence one of the needs of getting the byte position on disk). These are large mbox files and efficiency is a concern.

The purpose of all this is so that I can generate a summary file, which contains some small bits about each email in the mbox, and then in the future efficiently look up individual emails within the mbox.

like image 326
Mark Fletcher Avatar asked Apr 20 '12 18:04

Mark Fletcher


1 Answers

I haven't tested this, but something like this might work for you. Just open the file (in binary mode so your byte counts are correct), and scan through it, finding messages.

def is_mail_start(line):
    return line.startswith("From ")

def build_index(fname):
    with open(fname, "rb") as f:
        i = 0
        b = 0
        # find start of first message
        for line in f:
            b += len(line)
            if is_mail_start(line):
                break
        # find start of each message, and yield up (index, length) of previous message
        for line in f:
            if is_mail_start(line):
                yield (i, b)
                i += b
                b = 0
            b += len(line)
        yield (i, b) # yield up (index, length) of last message

# get index as a list
mbox_index = list(build_index(fname))

Once you have the index, you can use the .seek() method on a file object to seek there, and .read(length) on the file object to read just one message. I'm not sure how you will use the mailbox module with a string, though; I think it is meant to work on a mailbox in-place. Maybe there is some other mail-parsing module you can use.

like image 75
steveha Avatar answered Oct 23 '22 08:10

steveha