I am receiving an xml feed of product information. The information is in English, but it is not encoded in utf-8
(smart quotes, copyright symbols, etc.). To process the information, I need to convert it into utf-8
.
I have tried doing variations of:
u'%s' % data
codecs.open(..., 'utf-8')
unicode(data)
But for every one I've tried I get a UnicodeDecodeError
(of various sorts).
How would I convert all this text into utf-8
?
Update
Thanks for the help, here is what ended up working:
encoded_data = data.decode('ISO 8859-1').encode('utf-8').replace('Â','')
I'm not sure where the Â
came from, but I saw those next to some copyright symbols.
In order to convert it to UTF-8, you need to know what encoding it's in. Based on your description, I'm guessing that it's in one of the Latin-1 variants, ISO 8859-1 or Windows-1252. If that's the case, then you could convert it to UTF-8 like so:
data = 'Copyright \xA9 2012' # \xA9 is the copyright symbol in Windows-1252
# Convert from Windows-1252 to UTF-8
encoded = data.decode('Windows-1252').encode('utf-8')
# Prints "Copyright © 2012"
print encoded
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With