Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decoding escaped unicode in Python 3 from a non-ascii string

I have been searching for hours now to find a way to fully reverse the result of a str.encode-call like this:

"testäch基er".encode("cp1252", "backslashreplace")

The result is

b'test\xe4ch\\u57faer'

now i want to convert it back with

b'test\xe4ch\\u57faer'.decode("cp1252")

and I get

'testäch\\u57faer'

So how do I get my 基 back? I'm getting nearly there by using decode("unicode-escape") instead (it would work for this example), but that assumes bytes encoded with iso8859-1 not cp1252, so any characters between 80 and 9F would be wrong.

like image 543
Bachsau Avatar asked Mar 16 '23 11:03

Bachsau


2 Answers

Well...

>>> b'test\xe4ch\\u57faer'.decode('unicode-escape')
'testäch基er'

But backslashreplace->unicode-escape is not a consistent round trip. If you have backslashes in the original string, they won't get encoded by backslashreplace but they will get decoded by unicode-escape, and replaced with unexpected characters.

>>> '☃ \\u2603'.encode('cp1252', 'backslashreplace').decode('unicode-escape')
'☃ ☃'

There is no way to reliably reverse the encoding of string that has been encoded with an errors fallback such as backslashreplace. That's why it's a fallback, if you could consistently encode and decode to it, it would have been a real encoding.

like image 108
bobince Avatar answered Mar 18 '23 00:03

bobince


I was still very new to Python when I asked this question. Now I understand that these fallback mechanisms are just meant for handling unexpected errors, not something to save and restore data. If you really need a simple and reliable way to encode single unicode characters in ASCII, have a look at the quote and unquote functions from the urllib.parse module.

like image 35
Bachsau Avatar answered Mar 17 '23 23:03

Bachsau