Given an RFC822 message in Python 2.6, how can I get the right text/plain content part? Basically, the algorithm I want is this: <pre class="prettyprint"><code>message = email.message_from_string(raw_message) if has_mime_part(message, "text/plain"): mime_part = get_mime_part(message, "text/plain") text_content = decode_mime_part(mime_part) elif has_mime_part(message, "text/html"): mime_part = get_mime_part(message, "text/html") html = decode_mime_part(mime_part) text_content = render_html_to_plaintext(html) else: # fallback text_content = str(message) return text_content </code></pre> Of these things, I have <code>get_mime_part</code> and <code>has_mime_part</code> down pat, but I'm not quite sure how to get the decoded text from the MIME part. I can get the encoded text using <code>get_payload()</code>, but if I try to use the <code>decode</code> parameter of the <code>get_payload()</code> method (see the doc) I get an error when I call it on the text/plain part: <blockquote> <pre class="prettyprint"><code>File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/ email/message.py", line 189, in get_payload raise TypeError('Expected list, got %s' % type(self._payload)) TypeError: Expected list, got <type 'str'> </code></pre> </blockquote> In addition, I don't know how to take HTML and render it to text as closely as possible.

In a multipart e-mail, <code>email.message.Message.get_payload()</code> returns a list with one item for each part. The easiest way is to walk the message and get the payload on each part: <pre class="prettyprint"><code>import email msg = email.message_from_string(raw_message) for part in msg.walk(): # each part is a either non-multipart, or another multipart message # that contains further parts... Message is organized like a tree if part.get_content_type() == 'text/plain': print part.get_payload() # prints the raw text </code></pre> For a non-multipart message, no need to do all the walking. You can go straight to get_payload(), regardless of content_type. <pre class="prettyprint"><code>msg = email.message_from_string(raw_message) msg.get_payload() </code></pre> If the content is encoded, you need to pass <code>None</code> as the first parameter to <code>get_payload()</code>, followed by True (the decode flag is the second parameter). For example, suppose that my e-mail contains an MS Word document attachment: <pre class="prettyprint"><code>msg = email.message_from_string(raw_message) for part in msg.walk(): if part.get_content_type() == 'application/msword': name = part.get_param('name') or 'MyDoc.doc' f = open(name, 'wb') f.write(part.get_payload(None, True)) # You need None as the first param # because part.is_multipart() # is False f.close() </code></pre> As for getting a reasonable plain-text approximation of an HTML part, I've found that html2text works pretty darn well.

How can I get an email message's text content using Python?

Tags:

python

email

mime

rfc822

Given an RFC822 message in Python 2.6, how can I get the right text/plain content part? Basically, the algorithm I want is this:

message = email.message_from_string(raw_message) if has_mime_part(message, "text/plain"):     mime_part = get_mime_part(message, "text/plain")     text_content = decode_mime_part(mime_part) elif has_mime_part(message, "text/html"):     mime_part = get_mime_part(message, "text/html")     html = decode_mime_part(mime_part)     text_content = render_html_to_plaintext(html) else:     # fallback     text_content = str(message) return text_content

Of these things, I have get_mime_part and has_mime_part down pat, but I'm not quite sure how to get the decoded text from the MIME part. I can get the encoded text using get_payload(), but if I try to use the decode parameter of the get_payload() method (see the doc) I get an error when I call it on the text/plain part:

File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/ email/message.py", line 189, in get_payload     raise TypeError('Expected list, got %s' % type(self._payload)) TypeError: Expected list, got <type 'str'>

In addition, I don't know how to take HTML and render it to text as closely as possible.

701

asked Sep 22 '09 22:09

Chris R

2 Answers

In a multipart e-mail, email.message.Message.get_payload() returns a list with one item for each part. The easiest way is to walk the message and get the payload on each part:

import email msg = email.message_from_string(raw_message) for part in msg.walk():     # each part is a either non-multipart, or another multipart message     # that contains further parts... Message is organized like a tree     if part.get_content_type() == 'text/plain':         print part.get_payload() # prints the raw text

For a non-multipart message, no need to do all the walking. You can go straight to get_payload(), regardless of content_type.

msg = email.message_from_string(raw_message) msg.get_payload()

If the content is encoded, you need to pass None as the first parameter to get_payload(), followed by True (the decode flag is the second parameter). For example, suppose that my e-mail contains an MS Word document attachment:

msg = email.message_from_string(raw_message) for part in msg.walk():     if part.get_content_type() == 'application/msword':         name = part.get_param('name') or 'MyDoc.doc'         f = open(name, 'wb')         f.write(part.get_payload(None, True)) # You need None as the first param                                               # because part.is_multipart()                                                # is False         f.close()

As for getting a reasonable plain-text approximation of an HTML part, I've found that html2text works pretty darn well.

146

answered Sep 23 '22 05:09

Jarret Hardie

Flat is better than nested ;)

from email.mime.multipart import MIMEMultipart assert isinstance(msg, MIMEMultipart)  for _ in [k.get_payload() for k in msg.walk() if k.get_content_type() == 'text/plain']:     print _

answered Sep 23 '22 05:09

guneysus

Related questions
                            
                                How to create tzinfo when I have UTC offset?
                            
                                How to pad with zeros a tensor along some axis (Python)
                            
                                How to add custom css file to Sphinx?
                            
                                How to limit log file size in python
                            
                                Matplotlib figure to image as a numpy array
                            
                                spark 2.1.0 session config settings (pyspark)
                            
                                Python/pyspark data frame rearrange columns
                            
                                ValueError: Dependency on app with no migrations: customuser
                            
                                How can one display an image using cv2 in Python
                            
                                Python SqlAlchemy order_by DateTime?
                            
                                How to save the Pandas dataframe/series data as a figure?
                            
                                Recursive unittest discover
                            
                                Python BeautifulSoup extract text between element
                            
                                graph.write_pdf("iris.pdf") AttributeError: 'list' object has no attribute 'write_pdf'
                            
                                SQLAlchemy: Scan huge tables using ORM?
                            
                                profiling a method of a class in Python using cProfile?
                            
                                ValueError: unsupported format character while forming strings
                            
                                Split text after the second occurrence of character
                            
                                Why can't I install python3.6-dev on Ubuntu16.04
                            
                                Why does '(base)' appear in my anaconda command prompt?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With