Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle Python 3.x UnicodeDecodeError in Email package?

I try to read an email from a file, like this:

import email
with open("xxx.eml") as f:
   msg = email.message_from_file(f)

and I get this error:

Traceback (most recent call last):
  File "I:\fakt\real\maildecode.py", line 53, in <module>
    main()
  File "I:\fakt\real\maildecode.py", line 50, in main
    decode_file(infile, outfile)
  File "I:\fakt\real\maildecode.py", line 30, in decode_file
    msg = email.message_from_file(f)  #, policy=mypol
  File "C:\Python33\lib\email\__init__.py", line 56, in message_from_file
    return Parser(*args, **kws).parse(fp)
  File "C:\Python33\lib\email\parser.py", line 55, in parse
    data = fp.read(8192)
  File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1920: character maps to <undefined>

The file contains a multipart email, where the part is encoded in UTF-8. The file's content or encoding might be broken, but I have to handle it anyway.

How can I read the file, even if it has Unicode errors? I cannot find the policy object compat32 and there seems to be no way to handle an exception and let Python continue right where the exception occured.

What can I do?

like image 688
cxxl Avatar asked May 02 '13 16:05

cxxl


2 Answers

To parse an email message in Python 3 without unicode errors, read the file in binary mode and use the email.message_from_binary_file(f) (or email.message_from_bytes(f.read())) method to parse the content (see the documentation of the email.parser module).

Here is code that parses a message in a way that is compatible with Python 2 and 3:

import email
with open("xxx.eml", "rb") as f:
    try:
        msg = email.message_from_binary_file(f)  # Python 3
    except AttributeError:
        msg = email.message_from_file(f)  # Python 2

(tested with Python 2.7.13 and Python 3.6.0)

like image 121
Rob W Avatar answered Sep 23 '22 17:09

Rob W


I can't test on your message, so I don't know if this will actually work, but you can do the string decoding yourself:

with open("xxx.eml", encoding='utf-8', errors='replace') as f:
    text = f.read()
    msg = email.message_from_string(f)

That's going to get you a lot of replacement characters if the message isn't actually in UTF-8. But if it's got \x81 in it, UTF-8 is my guess.

like image 39
rspeer Avatar answered Sep 25 '22 17:09

rspeer