I have some Python code that's receiving a string with bad unicode in it. When I try to ignore the bad characters, Python still chokes (version 2.6.1). Here's how to reproduce it:
s = 'ad\xc2-ven\xc2-ture'
s.encode('utf8', 'ignore')
It throws
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)
What am I doing wrong?
Converting a string to a unicode instance is str.decode()
in Python 2.x:
>>> s.decode("ascii", "ignore")
u'ad-ven-ture'
You are confusing "unicode" and "utf-8". Your string s
is not unicode; it's a bytestring in a particular encoding (but not UTF-8, more likely iso-8859-1 or such.) Going from a bytestring to unicode
is done by decoding the data, not encoding. Going from unicode to bytestring is encoding. Perhaps you meant to make s
a unicode string:
>>> s = u'ad\xc2-ven\xc2-ture'
>>> s.encode('utf8', 'ignore')
'ad\xc3\x82-ven\xc3\x82-ture'
Or perhaps you want to treat the bytestring as UTF-8 but ignore invalid sequences, in which case you would decode the bytestring with 'ignore' as the error handler:
>>> s = 'ad\xc2-ven\xc2-ture'
>>> u = s.decode('utf-8', 'ignore')
>>> u
u'adventure'
>>> u.encode('utf-8')
'adventure'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With