Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 3 email body encoding

I am working on setting up a script that forwards incoming mail to a list of recipients.

Here's what I have now:

I read the email from stdin (that's how postfix passes it):

email_in = sys.stdin.read()

incoming = Parser().parse(email_in)

sender = incoming['from']
this_address = incoming['to']

I test for multipart:

if incoming.is_multipart():
    for payload in incoming.get_payload():
        # if payload.is_multipart(): ...
        body = payload.get_payload()
else:
    body = incoming.get_payload(decode=True)`

I set up the outgoing message:

msg = MIMEMultipart()
msg['Subject'] = incoming['subject']
msg['From'] = this_address
msg['reply-to'] = sender
msg['To'] = "[email protected]"
msg.attach(MIMEText(body.encode('utf-8'), 'html', _charset='UTF-8'))

s = smtplib.SMTP('localhost')
s.send_message(msg)
s.quit()

This works pretty well with ASCII characters (English text), forwards it and all.

When I send non-ascii characters though, it gives back gibberish (depending on email client bytes or ascii representations of the utf-8 chars)

What can be the problem? Is it on the incoming or the outgoing side?

like image 553
fonorobert Avatar asked Nov 18 '14 15:11

fonorobert


People also ask

What encoding does Python 3 use?

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding. (There are also UTF-16 and UTF-32 encodings, but they are less frequently used than UTF-8.)

What is encoders encode_base64 ()?

email.encoders. encode_base64 (msg) Encodes the payload into base64 form and sets the Content-Transfer-Encoding header to base64 .

What is default Python encoding?

The default encoding of Python source files is UTF-8. JSON, TOML, YAML use UTF-8. Most text editors, including Visual Studio Code and Windows Notepad use UTF-8 by default. Most websites and text data on the internet use UTF-8. And many other popular programming languages, including Node.

What is charset Python?

A character set is a set of valid characters acceptable by a programming language in scripting. In this case, we are talking about the Python programming language. So, the Python character set is a valid set of characters recognized by the Python language.


1 Answers

The problem is that many email clients (including Gmail) send non-ascii emails in base64. stdin on the other hand passes everything into a string. If you parse that with Parser.parse(), it returns a string type with base64 inside.

Instead the optional decode argument should be used on the get_payload() method. When that is set, the method returns a bytes type. After that you can use the builtin decode() method to get utf-8 string like so:

body = payload.get_payload(decode=True)
body = body.decode('utf-8')

There is great insight into utf-8 and python in Ned Batchelder's talk.

My final code works a bit differently, you can check that, too here.

like image 96
fonorobert Avatar answered Oct 13 '22 09:10

fonorobert