Here's my problem, I have a variable wrongly encoded that I want to fix. Long story short, I end up with:
myVar=u'\xc3\xa9'
which is wrong because it's the character 'é' or \u00e9
UTF-8 encoded, not unicode.
None of the combinations of encode/decode I tried seem to solve the problem. I looked towards the bytearray object, but you must provide an encoding, and obviously none of them fits.
Basically I need to reinterpret the byte array into the correct encoding. Any ideas on how to do that? Thanks.
What you should have done.
>>> b='\xc3\xa9'
>>> b
'\xc3\xa9'
>>> b.decode("UTF-8")
u'\xe9'
Since you didn't show the broken code that caused the problem, all we can do is make a complex problem more complex.
This appears to be what you're seeing.
>>> c
u'\xc3\xa9'
>>> c.decode("UTF-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Here's a workaround.
>>> [ chr(ord(x)) for x in c ]
['\xc3', '\xa9']
>>> ''.join(_)
'\xc3\xa9'
>>> _.decode("UTF-8")
u'\xe9'
Fix the code that produced the wrong stuff to begin with.
The hacky solution: pull out the codepoints with ord
, then build characters (length-one strings) out of these with chr
, then paste the lot back together and decode.
>>> u = u'\xc3\xa9'
>>> s = ''.join(chr(ord(c)) for c in u)
>>> unicode(s, encoding='utf-8')
u'\xe9'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With