In Python 3.5+ <code>.decode("utf-8", "backslashreplace")</code> is a pretty good option for dealing with partially-Unicode, partially-some-unknown-legacy-encoding binary strings. Valid UTF-8 sequences will be decoded and invalid ones will be preserved as escape sequences. For instance <pre class="prettyprint"><code>>>> print(b'\xc2\xa1\xa1'.decode("utf-8", "backslashreplace")) ¡\xa1 </code></pre> This loses the distinction between <code>b'\xc2\xa1\xa1'</code> and <code>b'\xc2\xa1\\xa1'</code>, but if you're in the "just get me something not too lossy that I can fix up by hand later" frame of mind, that's probably OK. However, this is a new feature in Python 3.5. The program I'm working on also needs to support 3.4 and 2.7. In those versions, it throws an exception: <pre class="prettyprint"><code>>>> print(b'\xc2\xa1\xa1'.decode("utf-8", "backslashreplace")) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) TypeError: don't know how to handle UnicodeDecodeError in error callback </code></pre> I have found an approximation, but not an exact equivalent: <pre class="prettyprint"><code>>>> print(b'\xc2\xa1\xa1'.decode("latin1") ... .encode("ascii", "backslashreplace").decode("ascii")) \xc2\xa1\xa1 </code></pre> It is very important that the behavior not depend on the interpreter version. Can anyone advise a way to get exactly the Python 3.5 behavior in 2.7 and 3.4? (Older versions of either 2.x or 3.x do not need to work. Monkey patching <code>codecs</code> is totally acceptable.)

I attempted a more complete backport of the cpython implementation This handles both <code>UnicodeDecodeError</code> (from <code>.decode()</code>) as well as <code>UnicodeEncodeError</code> from <code>.encode()</code> and <code>UnicodeTranslateError</code> from <code>.translate()</code>: <pre class="prettyprint"><code>from __future__ import unicode_literals import codecs def _bytes_repr(c): """py2: bytes, py3: int""" if not isinstance(c, int): c = ord(c) return '\\x{:x}'.format(c) def _text_repr(c): d = ord(c) if d >= 0x10000: return '\\U{:08x}'.format(d) else: return '\\u{:04x}'.format(d) def backslashescape_backport(ex): s, start, end = ex.object, ex.start, ex.end c_repr = _bytes_repr if isinstance(ex, UnicodeDecodeError) else _text_repr return ''.join(c_repr(c) for c in s[start:end]), end codecs.register_error('backslashescape_backport', backslashescape_backport) print(b'\xc2\xa1\xa1after'.decode('utf-8', 'backslashescape_backport')) print(u'\u2603'.encode('latin1', 'backslashescape_backport')) </code></pre>

Exact equivalent of `b'...'.decode("utf-8", "backslashreplace")` in Python 2

Tags:

python

python-3.x

encoding

python-2.7

backwards-compatibility

In Python 3.5+ .decode("utf-8", "backslashreplace") is a pretty good option for dealing with partially-Unicode, partially-some-unknown-legacy-encoding binary strings. Valid UTF-8 sequences will be decoded and invalid ones will be preserved as escape sequences. For instance

>>> print(b'\xc2\xa1\xa1'.decode("utf-8", "backslashreplace"))
¡\xa1

This loses the distinction between b'\xc2\xa1\xa1' and b'\xc2\xa1\\xa1', but if you're in the "just get me something not too lossy that I can fix up by hand later" frame of mind, that's probably OK.

However, this is a new feature in Python 3.5. The program I'm working on also needs to support 3.4 and 2.7. In those versions, it throws an exception:

>>> print(b'\xc2\xa1\xa1'.decode("utf-8", "backslashreplace"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
TypeError: don't know how to handle UnicodeDecodeError in error callback

I have found an approximation, but not an exact equivalent:

>>> print(b'\xc2\xa1\xa1'.decode("latin1")
...       .encode("ascii", "backslashreplace").decode("ascii"))
\xc2\xa1\xa1

It is very important that the behavior not depend on the interpreter version. Can anyone advise a way to get exactly the Python 3.5 behavior in 2.7 and 3.4?

(Older versions of either 2.x or 3.x do not need to work. Monkey patching codecs is totally acceptable.)

451

asked Mar 17 '17 14:03

zwol

2 Answers

I attempted a more complete backport of the cpython implementation

This handles both UnicodeDecodeError (from .decode()) as well as UnicodeEncodeError from .encode() and UnicodeTranslateError from .translate():

from __future__ import unicode_literals

import codecs


def _bytes_repr(c):
    """py2: bytes, py3: int"""
    if not isinstance(c, int):
        c = ord(c)
    return '\\x{:x}'.format(c)


def _text_repr(c):
    d = ord(c)
    if d >= 0x10000:
        return '\\U{:08x}'.format(d)
    else:
        return '\\u{:04x}'.format(d)


def backslashescape_backport(ex):
    s, start, end = ex.object, ex.start, ex.end
    c_repr = _bytes_repr if isinstance(ex, UnicodeDecodeError) else _text_repr
    return ''.join(c_repr(c) for c in s[start:end]), end


codecs.register_error('backslashescape_backport', backslashescape_backport)

print(b'\xc2\xa1\xa1after'.decode('utf-8', 'backslashescape_backport'))
print(u'\u2603'.encode('latin1', 'backslashescape_backport'))

162

answered Sep 27 '22 18:09

Anthony Sottile

You can write your own error handler. Here's a solution that I tested on Python 2.7, 3.3 and 3.6:

from __future__ import print_function
import codecs
import sys

print(sys.version)

def myreplace(ex):
    # The error handler receives the UnicodeDecodeError, which contains arguments of the
    # string and start/end indexes of the bad portion.
    bstr,start,end = ex.object,ex.start,ex.end

    # The return value is a tuple of Unicode string and the index to continue conversion.
    # Note: iterating byte strings returns int on 3.x but str on 2.x
    return u''.join('\\x{:02x}'.format(c if isinstance(c,int) else ord(c))
                    for c in bstr[start:end]),end

codecs.register_error('myreplace',myreplace)
print(b'\xc2\xa1\xa1ABC'.decode("utf-8", "myreplace"))

Output:

C:\>py -2.7 test.py
2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:42:59) [MSC v.1500 32 bit (Intel)]
¡\xa1ABC

C:\>py -3.3 test.py
3.3.5 (v3.3.5:62cf4e77f785, Mar  9 2014, 10:35:05) [MSC v.1600 64 bit (AMD64)]
¡\xa1ABC

C:\>py -3.6 test.py
3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)]
¡\xa1ABC

answered Sep 27 '22 17:09

Mark Tolonen

Related questions
                            
                                Connect to SMTP (SSL or TLS) using Python
                            
                                True=False assignment in Python 2.x [duplicate]
                            
                                How to find the path to a SSL cert file?
                            
                                How to terminate multiprocessing Pool processes?
                            
                                Mocking Oauth providers while testing
                            
                                Find subset with K elements that are closest to eachother
                            
                                how to convert a bs4.element.ResultSet to strings? Python
                            
                                Why does a function that returns itself max out recursion in python 3
                            
                                Chi squared test in Python
                            
                                Pandas time series time between events
                            
                                Run a chord callback even if the main tasks fail
                            
                                Is there a pythonic way to skip decoration on a subclass' method?
                            
                                How does pandas calculate skew
                            
                                Python Pandas, create empty DataFrame specifying column dtypes
                            
                                Difference between self.request and request in Django class-based view
                            
                                Pycharm expected type 'optional[bytes]' got 'str' instead
                            
                                Difference between numpy.float and numpy.float64
                            
                                Django app defaults?
                            
                                What is the fastest way to get an arbitrary element out of a Python dictionary?
                            
                                pandas read excel values not formulas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With