I have a unicode string like '%C3%A7%C3%B6asd+fjkls%25asd'
and I want to decode this string.
I used urllib.unquote_plus(str)
but it works wrong.
çöasd+fjkls%asd
çöasd fjkls%asd
double coded utf-8 characters(%C3%A7
and %C3%B6
) are decoded wrong.
My python version is 2.7 under a linux distro.
What is the best way to get expected result?
You have 3 or 4 or 5 problems ... but repr()
and unicodedata.name()
are your friends; they unambiguously show you exactly what you have got, without the confusion engendered by people with different console encodings communicating the results of print fubar
.
Summary: either (a) you start with a unicode object and apply the unquote function to that or (b) you start off with a str object and your console encoding is not UTF-8.
If as you say you start off with a unicode object:
>>> s0 = u'%C3%A7%C3%B6asd+fjkls%25asd'
>>> print repr(s0)
u'%C3%A7%C3%B6asd+fjkls%25asd'
this is an accidental nonsense. If you apply urllibX.unquote_YYYY()
to it, you get another nonsense unicode object (u'\xc3\xa7\xc3\xb6asd+fjkls%asd'
) which would cause your shown symptoms when printed. You should convert your original unicode object to a str object immediately:
>>> s1 = s0.encode('ascii')
>>> print repr(s1)
'%C3%A7%C3%B6asd+fjkls%25asd'
then you should unquote it:
>>> import urllib2
>>> s2 = urllib2.unquote(s1)
>>> print repr(s2)
'\xc3\xa7\xc3\xb6asd+fjkls%asd'
Looking at the first 4 bytes of that, it's encoded in UTF-8. If you do print s2
, it will look OK if your console is expecting UTF-8, but if it's expecting ISO-8859-1 (aka latin1) you'll see your symptomatic rubbish (first char will be A-tilde). Let's park that thought for a moment and convert it to a Unicode object:
>>> s3 = s2.decode('utf8')
>>> print repr(s3)
u'\xe7\xf6asd+fjkls%asd'
and inspect it to see what we've actually got:
>>> import unicodedata
>>> for c in s3[:6]:
... print repr(c), unicodedata.name(c)
...
u'\xe7' LATIN SMALL LETTER C WITH CEDILLA
u'\xf6' LATIN SMALL LETTER O WITH DIAERESIS
u'a' LATIN SMALL LETTER A
u's' LATIN SMALL LETTER S
u'd' LATIN SMALL LETTER D
u'+' PLUS SIGN
Looks like what you said you expected. Now we come to the question of displaying it on your console. Note: don't freak out when you see "cp850"; I'm doing this portably and just happen to be doing this in a Command Prompt on Windows.
>>> import sys
>>> sys.stdout.encoding
'cp850'
>>> print s3
çöasd+fjkls%asd
Note: the unicode object was explicitly encoded using sys.stdout.encoding. Fortunately all the unicode characters in s3
are representable in that encoding (and cp1252 and latin1).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With