Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decoding double encoded utf8 in Python

Tags:

I've got a problem with strings that I get from one of my clients over xmlrpc. He sends me utf8 strings that are encoded twice :( so when I get them in python I have an unicode object that has to be decoded one more time, but obviously python doesn't allow that. I've noticed my client however I need to do quick workaround for now before he fixes it.

Raw string from tcp dump:

<string>Rafa\xc3\x85\xc2\x82</string> 

this is converted into:

u'Rafa\xc5\x82' 

The best we get is:

eval(repr(u'Rafa\xc5\x82')[1:]).decode("utf8")  

This results in correct string which is:

u'Rafa\u0142'  

this works however is ugly as hell and cannot be used in production code. If anyone knows how to fix this problem in more suitable way please write. Thanks, Chris

like image 354
Chris Ciesielski Avatar asked Jul 24 '09 12:07

Chris Ciesielski


People also ask

How do I decode a UTF-8 string in Python?

To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.

What does encoding =' UTF-8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.

Is UTF8 the same as UTF-8?

There is no difference between "utf8" and "utf-8"; they are simply two names for UTF8, the most common Unicode encoding.


1 Answers

 >>> s = u'Rafa\xc5\x82' >>> s.encode('raw_unicode_escape').decode('utf-8') u'Rafa\u0142' >>> 
like image 133
Ivan Baldin Avatar answered Nov 06 '22 11:11

Ivan Baldin