Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python failing to encode bad unicode to ascii

Tags:

python

unicode

I have some Python code that's receiving a string with bad unicode in it. When I try to ignore the bad characters, Python still chokes (version 2.6.1). Here's how to reproduce it:

s = 'ad\xc2-ven\xc2-ture'
s.encode('utf8', 'ignore')

It throws

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)

What am I doing wrong?

like image 510
Eric Palakovich Carr Avatar asked May 25 '11 13:05

Eric Palakovich Carr


2 Answers

Converting a string to a unicode instance is str.decode() in Python 2.x:

 >>> s.decode("ascii", "ignore")
 u'ad-ven-ture'
like image 98
Sven Marnach Avatar answered Sep 28 '22 03:09

Sven Marnach


You are confusing "unicode" and "utf-8". Your string s is not unicode; it's a bytestring in a particular encoding (but not UTF-8, more likely iso-8859-1 or such.) Going from a bytestring to unicode is done by decoding the data, not encoding. Going from unicode to bytestring is encoding. Perhaps you meant to make s a unicode string:

>>> s = u'ad\xc2-ven\xc2-ture'
>>> s.encode('utf8', 'ignore')
'ad\xc3\x82-ven\xc3\x82-ture'

Or perhaps you want to treat the bytestring as UTF-8 but ignore invalid sequences, in which case you would decode the bytestring with 'ignore' as the error handler:

>>> s = 'ad\xc2-ven\xc2-ture'
>>> u = s.decode('utf-8', 'ignore')
>>> u
u'adventure'
>>> u.encode('utf-8')
'adventure'
like image 29
Thomas Wouters Avatar answered Sep 28 '22 03:09

Thomas Wouters