Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently parse emails without touching attachments using Python

I'm playing with Python imaplib (Python 2.6) to fetch emails from GMail. Everything I fetch an email with method http://docs.python.org/library/imaplib.html#imaplib.IMAP4.fetch I get whole email. I need only text part and also parse names of attachments, without downloading them. How this can be done? I see that emails returned by GMail follow the same format that browsers send to HTTP servers.

like image 297
Viet Avatar asked Feb 20 '10 05:02

Viet


2 Answers

Take a look at this recipe: http://code.activestate.com/recipes/498189/

I adapted it slightly to print the From, Subject, Date, name of attachments, and message body (just plaintext for now -- its trivial to add html messages).

I used the Gmail pop3 server in this case, but it should work for IMAP as well.

import poplib, email, string

mailserver = poplib.POP3_SSL('pop.gmail.com')
mailserver.user('recent:YOURUSERNAME') #use 'recent mode'
mailserver.pass_('YOURPASSWORD') #consider not storing in plaintext!

numMessages = len(mailserver.list()[1])
for i in reversed(range(numMessages)):
    message = ""
    msg = mailserver.retr(i+1)
    str = string.join(msg[1], "\n")
    mail = email.message_from_string(str)

    message += "From: " + mail["From"] + "\n"
    message += "Subject: " + mail["Subject"] + "\n"
    message += "Date: " + mail["Date"] + "\n"

    for part in mail.walk():
        if part.is_multipart():
            continue
        if part.get_content_type() == 'text/plain':
            body = "\n" + part.get_payload() + "\n"
        dtypes = part.get_params(None, 'Content-Disposition')
        if not dtypes:
            if part.get_content_type() == 'text/plain':
                continue
            ctypes = part.get_params()
            if not ctypes:
                continue
            for key,val in ctypes:
                if key.lower() == 'name':
                    message += "Attachment:" + val + "\n"
                    break
            else:
                continue
        else:
            attachment,filename = None,None
            for key,val in dtypes:
                key = key.lower()
                if key == 'filename':
                    filename = val
                if key == 'attachment':
                    attachment = 1
            if not attachment:
                continue
            message += "Attachment:" + filename + "\n"
        if body:
            message += body + "\n"
    print message
    print

This should be enough to get you heading in the right direction.

like image 130
swanson Avatar answered Oct 25 '22 20:10

swanson


You can get only the plain text of the email by doing something like:

connection.fetch(id, '(BODY[1])')

For the gmail messages I've seen, section 1 has the plaintext, including multipart junk. This may not be so robust.

I don't know how to get the name of the attachment without all of it. I haven't tried using partials.

like image 42
Dan Benamy Avatar answered Oct 25 '22 18:10

Dan Benamy