Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do surrogateescape in python2

Python3 changed the unicode behaviour to deny surrogate pairs while python2 not.

There's a question here

But it do not supply a solution on how to remove surrogate pairs in python2 or how to do surrogate escape.

Python3 example:

>>> a = b'\xed\xa0\xbd\xe4\xbd\xa0\xe5\xa5\xbd'
>>> a.decode('utf-8', 'surrogateescape')
'\udced\udca0\udcbd你好'
>>> a.decode('utf-8', 'ignore')
'你好'

The '\xed\xa0\xbd' here is not proper utf-8 chars. And I want to ignore them or escape them.

Is it possible to do the same thing in python2?

like image 564
lxyu Avatar asked Oct 29 '13 04:10

lxyu


1 Answers

There is no builtin solution, but there is an implementation of surrogateescapes in python-future: https://github.com/PythonCharmers/python-future

Add from future.utils.surrogateescape import register_surrogateescape to the imports. Then call the method register_surrogateescape() and then you can use the errors='surrogateescape' error handler in encode and decode.

An example can be found here

like image 156
proski Avatar answered Oct 21 '22 12:10

proski