I am working against an application that seems keen on returning, what I believe to be, double UTF-8 encoded strings. I send the string <code>u'XüYß'</code> encoded using UTF-8, thus becoming <code>X\u00fcY\u00df</code> (equal to <code>X\xc3\xbcY\xc3\x9f</code>). The server should simply echo what I sent it, yet returns the following: <code>X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f</code> (should be <code>X\xc3\xbcY\xc3\x9f</code>). If I decode it using <code>str.decode('utf-8')</code> becomes <code>u'X\xc3\xbcY\xc3\x9f'</code>, which looks like a ... unicode-string, containing the original string encoded using UTF-8. But Python won't let me decode a unicode string without re-encoding it first - which fails for some reason, that escapes me: <pre class="prettyprint"><code>>>> ret = 'X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f'.decode('utf-8') >>> ret u'X\xc3\xbcY\xc3\x9f' >>> ret.decode('utf-8') # Throws UnicodeEncodeError: 'ascii' codec can't encode ... </code></pre> How do I persuade Python to re-decode the string? - and/or is there any (practical) way of debugging what's actually in the strings, without passing it though all the implicit conversion <code>print</code> uses? (And yes, I have reported this behaviour with the developers of the server-side.)

Double-decoding unicode in python

Tags:

I am working against an application that seems keen on returning, what I believe to be, double UTF-8 encoded strings.

I send the string u'XüYß' encoded using UTF-8, thus becoming X\u00fcY\u00df (equal to X\xc3\xbcY\xc3\x9f).

The server should simply echo what I sent it, yet returns the following: X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f (should be X\xc3\xbcY\xc3\x9f). If I decode it using str.decode('utf-8') becomes u'X\xc3\xbcY\xc3\x9f', which looks like a ... unicode-string, containing the original string encoded using UTF-8.

But Python won't let me decode a unicode string without re-encoding it first - which fails for some reason, that escapes me:

>>> ret = 'X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f'.decode('utf-8')
>>> ret
u'X\xc3\xbcY\xc3\x9f'
>>> ret.decode('utf-8')
# Throws UnicodeEncodeError: 'ascii' codec can't encode ...

How do I persuade Python to re-decode the string? - and/or is there any (practical) way of debugging what's actually in the strings, without passing it though all the implicit conversion print uses?

(And yes, I have reported this behaviour with the developers of the server-side.)

Related questions
                            
                                Converting lists of tuples to strings Python
                            
                                A selenium webdriver exception
                            
                                How to add an image on the top of another image?
                            
                                Rails url helper not encoding ampersands
                            
                                Regex for java's String.matches method?
                            
                                Decrypt kerberos ticket using Spnego
                            
                                How does Play!'s Comet support work?
                            
                                Maximize another process' Window in .NET
                            
                                invalidate() inside of a thread android app
                            
                                Does the JVM give back free memory to the OS when no longer needed?
                            
                                Get shopping cart details in Magento
                            
                                pojo parse gson with invalid java names

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Double-decoding unicode in python

Tags:

Related questions

Recent Activity

Donate For Us