I've got a problem with strings that I get from one of my clients over xmlrpc. He sends me utf8 strings that are encoded twice :( so when I get them in python I have an unicode object that has to be decoded one more time, but obviously python doesn't allow that. I've noticed my client however I need to do quick workaround for now before he fixes it.
Raw string from tcp dump:
<string>Rafa\xc3\x85\xc2\x82</string>
this is converted into:
u'Rafa\xc5\x82'
The best we get is:
eval(repr(u'Rafa\xc5\x82')[1:]).decode("utf8")
This results in correct string which is:
u'Rafa\u0142'
this works however is ugly as hell and cannot be used in production code. If anyone knows how to fix this problem in more suitable way please write. Thanks, Chris
To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.
UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.
There is no difference between "utf8" and "utf-8"; they are simply two names for UTF8, the most common Unicode encoding.
>>> s = u'Rafa\xc5\x82' >>> s.encode('raw_unicode_escape').decode('utf-8') u'Rafa\u0142' >>>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With