Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert mass text to utf-8

Tags:

python

utf-8

I am receiving an xml feed of product information. The information is in English, but it is not encoded in utf-8 (smart quotes, copyright symbols, etc.). To process the information, I need to convert it into utf-8.

I have tried doing variations of:

u'%s' % data
codecs.open(..., 'utf-8')
unicode(data)

But for every one I've tried I get a UnicodeDecodeError (of various sorts).

How would I convert all this text into utf-8 ?

Update

Thanks for the help, here is what ended up working:

encoded_data = data.decode('ISO 8859-1').encode('utf-8').replace('Â','')

I'm not sure where the  came from, but I saw those next to some copyright symbols.

like image 401
David542 Avatar asked Nov 29 '22 16:11

David542


1 Answers

In order to convert it to UTF-8, you need to know what encoding it's in. Based on your description, I'm guessing that it's in one of the Latin-1 variants, ISO 8859-1 or Windows-1252. If that's the case, then you could convert it to UTF-8 like so:

data = 'Copyright \xA9 2012'  # \xA9 is the copyright symbol in Windows-1252

# Convert from Windows-1252 to UTF-8
encoded = data.decode('Windows-1252').encode('utf-8')

# Prints "Copyright © 2012"
print encoded
like image 189
Adam Rosenfield Avatar answered Dec 04 '22 23:12

Adam Rosenfield