Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get an email message's text content using Python?

Given an RFC822 message in Python 2.6, how can I get the right text/plain content part? Basically, the algorithm I want is this:

message = email.message_from_string(raw_message) if has_mime_part(message, "text/plain"):     mime_part = get_mime_part(message, "text/plain")     text_content = decode_mime_part(mime_part) elif has_mime_part(message, "text/html"):     mime_part = get_mime_part(message, "text/html")     html = decode_mime_part(mime_part)     text_content = render_html_to_plaintext(html) else:     # fallback     text_content = str(message) return text_content 

Of these things, I have get_mime_part and has_mime_part down pat, but I'm not quite sure how to get the decoded text from the MIME part. I can get the encoded text using get_payload(), but if I try to use the decode parameter of the get_payload() method (see the doc) I get an error when I call it on the text/plain part:

File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/ email/message.py", line 189, in get_payload     raise TypeError('Expected list, got %s' % type(self._payload)) TypeError: Expected list, got <type 'str'> 

In addition, I don't know how to take HTML and render it to text as closely as possible.

like image 701
Chris R Avatar asked Sep 22 '09 22:09

Chris R


People also ask

How do I get the contents of an email in Python?

We used Google App Password to connect our Python script to the Gmail account, so our Python program could read the email from the inbox. You do not need to do it if you are using a different email provider or server. There, you can log in to your account just with your email id and password with the Python program.

CAN message data Python?

The data parameter of a CAN message is exposed as a bytearray with length between 0 and 8. The DLC parameter of a CAN message is an integer between 0 and 8 representing the frame payload length. In the case of a CAN FD message, this indicates the data length in number of bytes.


2 Answers

In a multipart e-mail, email.message.Message.get_payload() returns a list with one item for each part. The easiest way is to walk the message and get the payload on each part:

import email msg = email.message_from_string(raw_message) for part in msg.walk():     # each part is a either non-multipart, or another multipart message     # that contains further parts... Message is organized like a tree     if part.get_content_type() == 'text/plain':         print part.get_payload() # prints the raw text 

For a non-multipart message, no need to do all the walking. You can go straight to get_payload(), regardless of content_type.

msg = email.message_from_string(raw_message) msg.get_payload() 

If the content is encoded, you need to pass None as the first parameter to get_payload(), followed by True (the decode flag is the second parameter). For example, suppose that my e-mail contains an MS Word document attachment:

msg = email.message_from_string(raw_message) for part in msg.walk():     if part.get_content_type() == 'application/msword':         name = part.get_param('name') or 'MyDoc.doc'         f = open(name, 'wb')         f.write(part.get_payload(None, True)) # You need None as the first param                                               # because part.is_multipart()                                                # is False         f.close() 

As for getting a reasonable plain-text approximation of an HTML part, I've found that html2text works pretty darn well.

like image 146
Jarret Hardie Avatar answered Sep 23 '22 05:09

Jarret Hardie


Flat is better than nested ;)

from email.mime.multipart import MIMEMultipart assert isinstance(msg, MIMEMultipart)  for _ in [k.get_payload() for k in msg.walk() if k.get_content_type() == 'text/plain']:     print _ 
like image 27
guneysus Avatar answered Sep 23 '22 05:09

guneysus