I have been searching for hours now to find a way to fully reverse the result of a str.encode-call like this:
"testäch基er".encode("cp1252", "backslashreplace")
The result is
b'test\xe4ch\\u57faer'
now i want to convert it back with
b'test\xe4ch\\u57faer'.decode("cp1252")
and I get
'testäch\\u57faer'
So how do I get my 基 back? I'm getting nearly there by using decode("unicode-escape") instead (it would work for this example), but that assumes bytes encoded with iso8859-1 not cp1252, so any characters between 80 and 9F would be wrong.
Well...
>>> b'test\xe4ch\\u57faer'.decode('unicode-escape')
'testäch基er'
But backslashreplace
->unicode-escape
is not a consistent round trip. If you have backslashes in the original string, they won't get encoded by backslashreplace
but they will get decoded by unicode-escape
, and replaced with unexpected characters.
>>> '☃ \\u2603'.encode('cp1252', 'backslashreplace').decode('unicode-escape')
'☃ ☃'
There is no way to reliably reverse the encoding of string that has been encoded with an errors
fallback such as backslashreplace
. That's why it's a fallback, if you could consistently encode and decode to it, it would have been a real encoding
.
I was still very new to Python when I asked this question. Now I understand that these fallback mechanisms are just meant for handling unexpected errors, not something to save and restore data. If you really need a simple and reliable way to encode single unicode characters in ASCII, have a look at the quote
and unquote
functions from the urllib.parse
module.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With