In Python 3.5+ .decode("utf-8", "backslashreplace")
is a pretty good option for dealing with partially-Unicode, partially-some-unknown-legacy-encoding binary strings. Valid UTF-8 sequences will be decoded and invalid ones will be preserved as escape sequences. For instance
>>> print(b'\xc2\xa1\xa1'.decode("utf-8", "backslashreplace"))
¡\xa1
This loses the distinction between b'\xc2\xa1\xa1'
and b'\xc2\xa1\\xa1'
, but if you're in the "just get me something not too lossy that I can fix up by hand later" frame of mind, that's probably OK.
However, this is a new feature in Python 3.5. The program I'm working on also needs to support 3.4 and 2.7. In those versions, it throws an exception:
>>> print(b'\xc2\xa1\xa1'.decode("utf-8", "backslashreplace"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
TypeError: don't know how to handle UnicodeDecodeError in error callback
I have found an approximation, but not an exact equivalent:
>>> print(b'\xc2\xa1\xa1'.decode("latin1")
... .encode("ascii", "backslashreplace").decode("ascii"))
\xc2\xa1\xa1
It is very important that the behavior not depend on the interpreter version. Can anyone advise a way to get exactly the Python 3.5 behavior in 2.7 and 3.4?
(Older versions of either 2.x or 3.x do not need to work. Monkey patching codecs
is totally acceptable.)
To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.
UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.
In Python 2, the default encoding is ASCII (unfortunately). UTF-16 is variable 2 or 4 bytes. This encoding is great for Asian text as most of it can be encoded in 2 bytes each.
decode() is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string.
I attempted a more complete backport of the cpython implementation
This handles both UnicodeDecodeError
(from .decode()
) as well as UnicodeEncodeError
from .encode()
and UnicodeTranslateError
from .translate()
:
from __future__ import unicode_literals
import codecs
def _bytes_repr(c):
"""py2: bytes, py3: int"""
if not isinstance(c, int):
c = ord(c)
return '\\x{:x}'.format(c)
def _text_repr(c):
d = ord(c)
if d >= 0x10000:
return '\\U{:08x}'.format(d)
else:
return '\\u{:04x}'.format(d)
def backslashescape_backport(ex):
s, start, end = ex.object, ex.start, ex.end
c_repr = _bytes_repr if isinstance(ex, UnicodeDecodeError) else _text_repr
return ''.join(c_repr(c) for c in s[start:end]), end
codecs.register_error('backslashescape_backport', backslashescape_backport)
print(b'\xc2\xa1\xa1after'.decode('utf-8', 'backslashescape_backport'))
print(u'\u2603'.encode('latin1', 'backslashescape_backport'))
You can write your own error handler. Here's a solution that I tested on Python 2.7, 3.3 and 3.6:
from __future__ import print_function
import codecs
import sys
print(sys.version)
def myreplace(ex):
# The error handler receives the UnicodeDecodeError, which contains arguments of the
# string and start/end indexes of the bad portion.
bstr,start,end = ex.object,ex.start,ex.end
# The return value is a tuple of Unicode string and the index to continue conversion.
# Note: iterating byte strings returns int on 3.x but str on 2.x
return u''.join('\\x{:02x}'.format(c if isinstance(c,int) else ord(c))
for c in bstr[start:end]),end
codecs.register_error('myreplace',myreplace)
print(b'\xc2\xa1\xa1ABC'.decode("utf-8", "myreplace"))
Output:
C:\>py -2.7 test.py 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:42:59) [MSC v.1500 32 bit (Intel)] ¡\xa1ABC C:\>py -3.3 test.py 3.3.5 (v3.3.5:62cf4e77f785, Mar 9 2014, 10:35:05) [MSC v.1600 64 bit (AMD64)] ¡\xa1ABC C:\>py -3.6 test.py 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] ¡\xa1ABC
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With