How to do surrogateescape in python2

Question

Python3 changed the unicode behaviour to deny surrogate pairs while python2 not.

There's a question here

But it do not supply a solution on how to remove surrogate pairs in python2 or how to do surrogate escape.

Python3 example:

>>> a = b'\xed\xa0\xbd\xe4\xbd\xa0\xe5\xa5\xbd'
>>> a.decode('utf-8', 'surrogateescape')
'\udced\udca0\udcbd你好'
>>> a.decode('utf-8', 'ignore')
'你好'

The '\xed\xa0\xbd' here is not proper utf-8 chars. And I want to ignore them or escape them.

Is it possible to do the same thing in python2?

proski · Accepted Answer

There is no builtin solution, but there is an implementation of surrogateescapes in python-future: https://github.com/PythonCharmers/python-future

Add from future.utils.surrogateescape import register_surrogateescape to the imports. Then call the method register_surrogateescape() and then you can use the errors='surrogateescape' error handler in encode and decode.

An example can be found here

How to do surrogateescape in python2

Tags:

python

unicode

python-2.x

surrogate-pairs

lxyu

1 Answers

proski

Recent Activity

Donate For Us

How to do surrogateescape in python2

Tags:

python

unicode

python-2.x

surrogate-pairs

lxyu

1 Answers

proski

Related questions

Recent Activity

Donate For Us