I try to read an email from a file, like this:
import email
with open("xxx.eml") as f:
msg = email.message_from_file(f)
and I get this error:
Traceback (most recent call last):
File "I:\fakt\real\maildecode.py", line 53, in <module>
main()
File "I:\fakt\real\maildecode.py", line 50, in main
decode_file(infile, outfile)
File "I:\fakt\real\maildecode.py", line 30, in decode_file
msg = email.message_from_file(f) #, policy=mypol
File "C:\Python33\lib\email\__init__.py", line 56, in message_from_file
return Parser(*args, **kws).parse(fp)
File "C:\Python33\lib\email\parser.py", line 55, in parse
data = fp.read(8192)
File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1920: character maps to <undefined>
The file contains a multipart email, where the part is encoded in UTF-8. The file's content or encoding might be broken, but I have to handle it anyway.
How can I read the file, even if it has Unicode errors? I cannot find the policy object compat32
and there seems to be no way to handle an exception and let Python continue right where the exception occured.
What can I do?
To parse an email message in Python 3 without unicode errors, read the file in binary mode and use the email.message_from_binary_file(f)
(or email.message_from_bytes(f.read())
) method to parse the content (see the documentation of the email.parser module).
Here is code that parses a message in a way that is compatible with Python 2 and 3:
import email
with open("xxx.eml", "rb") as f:
try:
msg = email.message_from_binary_file(f) # Python 3
except AttributeError:
msg = email.message_from_file(f) # Python 2
(tested with Python 2.7.13 and Python 3.6.0)
I can't test on your message, so I don't know if this will actually work, but you can do the string decoding yourself:
with open("xxx.eml", encoding='utf-8', errors='replace') as f:
text = f.read()
msg = email.message_from_string(f)
That's going to get you a lot of replacement characters if the message isn't actually in UTF-8. But if it's got \x81
in it, UTF-8 is my guess.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With