Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exact equivalent of `b'...'.decode("utf-8", "backslashreplace")` in Python 2

In Python 3.5+ .decode("utf-8", "backslashreplace") is a pretty good option for dealing with partially-Unicode, partially-some-unknown-legacy-encoding binary strings. Valid UTF-8 sequences will be decoded and invalid ones will be preserved as escape sequences. For instance

>>> print(b'\xc2\xa1\xa1'.decode("utf-8", "backslashreplace"))
¡\xa1

This loses the distinction between b'\xc2\xa1\xa1' and b'\xc2\xa1\\xa1', but if you're in the "just get me something not too lossy that I can fix up by hand later" frame of mind, that's probably OK.

However, this is a new feature in Python 3.5. The program I'm working on also needs to support 3.4 and 2.7. In those versions, it throws an exception:

>>> print(b'\xc2\xa1\xa1'.decode("utf-8", "backslashreplace"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
TypeError: don't know how to handle UnicodeDecodeError in error callback

I have found an approximation, but not an exact equivalent:

>>> print(b'\xc2\xa1\xa1'.decode("latin1")
...       .encode("ascii", "backslashreplace").decode("ascii"))
\xc2\xa1\xa1

It is very important that the behavior not depend on the interpreter version. Can anyone advise a way to get exactly the Python 3.5 behavior in 2.7 and 3.4?

(Older versions of either 2.x or 3.x do not need to work. Monkey patching codecs is totally acceptable.)

like image 451
zwol Avatar asked Mar 17 '17 14:03

zwol


People also ask

How do I decode a UTF-8 string in Python?

To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.

What does encoding =' UTF-8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.

What encoding does Python 2 use?

In Python 2, the default encoding is ASCII (unfortunately). UTF-16 is variable 2 or 4 bytes. This encoding is great for Asian text as most of it can be encoded in 2 bytes each.

How do you decode a file in Python?

decode() is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string.


2 Answers

I attempted a more complete backport of the cpython implementation

This handles both UnicodeDecodeError (from .decode()) as well as UnicodeEncodeError from .encode() and UnicodeTranslateError from .translate():

from __future__ import unicode_literals

import codecs


def _bytes_repr(c):
    """py2: bytes, py3: int"""
    if not isinstance(c, int):
        c = ord(c)
    return '\\x{:x}'.format(c)


def _text_repr(c):
    d = ord(c)
    if d >= 0x10000:
        return '\\U{:08x}'.format(d)
    else:
        return '\\u{:04x}'.format(d)


def backslashescape_backport(ex):
    s, start, end = ex.object, ex.start, ex.end
    c_repr = _bytes_repr if isinstance(ex, UnicodeDecodeError) else _text_repr
    return ''.join(c_repr(c) for c in s[start:end]), end


codecs.register_error('backslashescape_backport', backslashescape_backport)

print(b'\xc2\xa1\xa1after'.decode('utf-8', 'backslashescape_backport'))
print(u'\u2603'.encode('latin1', 'backslashescape_backport'))
like image 162
Anthony Sottile Avatar answered Sep 27 '22 18:09

Anthony Sottile


You can write your own error handler. Here's a solution that I tested on Python 2.7, 3.3 and 3.6:

from __future__ import print_function
import codecs
import sys

print(sys.version)

def myreplace(ex):
    # The error handler receives the UnicodeDecodeError, which contains arguments of the
    # string and start/end indexes of the bad portion.
    bstr,start,end = ex.object,ex.start,ex.end

    # The return value is a tuple of Unicode string and the index to continue conversion.
    # Note: iterating byte strings returns int on 3.x but str on 2.x
    return u''.join('\\x{:02x}'.format(c if isinstance(c,int) else ord(c))
                    for c in bstr[start:end]),end

codecs.register_error('myreplace',myreplace)
print(b'\xc2\xa1\xa1ABC'.decode("utf-8", "myreplace"))

Output:

C:\>py -2.7 test.py
2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:42:59) [MSC v.1500 32 bit (Intel)]
¡\xa1ABC

C:\>py -3.3 test.py
3.3.5 (v3.3.5:62cf4e77f785, Mar  9 2014, 10:35:05) [MSC v.1600 64 bit (AMD64)]
¡\xa1ABC

C:\>py -3.6 test.py
3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)]
¡\xa1ABC
like image 39
Mark Tolonen Avatar answered Sep 27 '22 17:09

Mark Tolonen