Python3 changed the unicode behaviour to deny surrogate pairs while python2 not.
There's a question here
But it do not supply a solution on how to remove surrogate pairs in python2 or how to do surrogate escape.
Python3 example:
>>> a = b'\xed\xa0\xbd\xe4\xbd\xa0\xe5\xa5\xbd'
>>> a.decode('utf-8', 'surrogateescape')
'\udced\udca0\udcbd你好'
>>> a.decode('utf-8', 'ignore')
'你好'
The '\xed\xa0\xbd' here is not proper utf-8 chars. And I want to ignore them or escape them.
Is it possible to do the same thing in python2?
There is no builtin solution, but there is an implementation of surrogateescapes in python-future: https://github.com/PythonCharmers/python-future
Add from future.utils.surrogateescape import register_surrogateescape
to the imports. Then call the method register_surrogateescape()
and then you can use the errors='surrogateescape'
error handler in encode
and decode
.
An example can be found here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With