Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python ASCII and Unicode decode error

I got this very very frustrating error when inserting a certain string into my database. It said something like:

Python cannot decode byte characters, expecting unicode"

After a lot of searching, I saw that I could overcome this error by encoding my string into Unicode. I try to do this by decoding the string first and then encoding it in UTF-8 format. Like:

string = string.encode("utf8")

And I get the following error:

'ascii' codec can't decode byte 0xe3 in position 6: ordinal not in range(128)

I have been dying with this error! How do I fix it?

like image 774
Amitash Avatar asked Jul 18 '12 15:07

Amitash


4 Answers

You need to take a disciplined approach. Pragmatic Unicode, or How Do I Stop The Pain? has everything you need.

If you get that error on that line of code, then the problem is that string is a byte string, and Python 2 is implicitly trying to decode it to Unicode for you. But it isn't pure ascii. You need to know what the encoding is, and decode it properly.

like image 192
Ned Batchelder Avatar answered Nov 12 '22 04:11

Ned Batchelder


The encode method should be used on unicode objects to convert them to a str object with a given encoding. The decode method should be used on str objects of a given encoding to convert them unicode objects.

I suppose that your database store strings in UTF-8. So when you get strings from the database, convert them to unicode objects by doing str.decode('utf-8'). Then only use unicode objects in your python program (literals are defined with u'unicode string'). And just before storing them in your database, convert them to str objects with uni.encode('utf-8').

like image 12
Sylvain Defresne Avatar answered Nov 12 '22 05:11

Sylvain Defresne


EDIT: As you can see from the downvotes, this is NOT THE BEST WAY TO DO IT. An excellent, and a highly recommended answer is immediately after this, so if you are looking for a good solution, please use that. This is a hackish solution that will not be kind to you at a later point of time.

I feel your pain, I've had a lot of problems with the same error. The simplest way I solved it (and this might not be the best way, and it depends on your application) was to convert things to unicode, and ignore errors. Here's an example from Unicode HOWTO - Python v2.7.3 documentation

>>> unicode('\x80abc', errors='strict')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
                    ordinal not in range(128)
>>> unicode('\x80abc', errors='replace')
u'\ufffdabc'
>>> unicode('\x80abc', errors='ignore')
u'abc'

While this might not be the most expedient method, this is a method that has worked for me.

EDIT:

A couple of people in the comments have mentioned that this is a bad idea, even though the asker accepted the answer. It is NOT a great idea, it will screw things up if you are dealing with european and accented characters. However, this is something you can use if it is NOT production level code, if it is a personal project you are working on, and you need a quick fix to get things rolling. You will eventually need to fix it with the right methods, which are mentioned in the answers below.

like image 5
Karthik Rangarajan Avatar answered Nov 12 '22 06:11

Karthik Rangarajan


The 0xE3 codepoint is an 'a' with a tilde in Unicode. Your original string is most likely already in UTF-8, so you can't decode it using the default ASCII character set.

like image 2
Silas Ray Avatar answered Nov 12 '22 04:11

Silas Ray