Python ASCII and Unicode decode error

Question

I got this very very frustrating error when inserting a certain string into my database. It said something like:

Python cannot decode byte characters, expecting unicode"

After a lot of searching, I saw that I could overcome this error by encoding my string into Unicode. I try to do this by decoding the string first and then encoding it in UTF-8 format. Like:

string = string.encode("utf8")

And I get the following error:

'ascii' codec can't decode byte 0xe3 in position 6: ordinal not in range(128)

I have been dying with this error! How do I fix it?

Ned Batchelder · Accepted Answer

You need to take a disciplined approach. Pragmatic Unicode, or How Do I Stop The Pain? has everything you need.

If you get that error on that line of code, then the problem is that string is a byte string, and Python 2 is implicitly trying to decode it to Unicode for you. But it isn't pure ascii. You need to know what the encoding is, and decode it properly.

Sylvain Defresne · Answer

The encode method should be used on unicode objects to convert them to a str object with a given encoding. The decode method should be used on str objects of a given encoding to convert them unicode objects.

I suppose that your database store strings in UTF-8. So when you get strings from the database, convert them to unicode objects by doing str.decode('utf-8'). Then only use unicode objects in your python program (literals are defined with u'unicode string'). And just before storing them in your database, convert them to str objects with uni.encode('utf-8').

Karthik Rangarajan · Answer

EDIT: As you can see from the downvotes, this is NOT THE BEST WAY TO DO IT. An excellent, and a highly recommended answer is immediately after this, so if you are looking for a good solution, please use that. This is a hackish solution that will not be kind to you at a later point of time.

I feel your pain, I've had a lot of problems with the same error. The simplest way I solved it (and this might not be the best way, and it depends on your application) was to convert things to unicode, and ignore errors. Here's an example from Unicode HOWTO - Python v2.7.3 documentation

>>> unicode('\x80abc', errors='strict')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
                    ordinal not in range(128)
>>> unicode('\x80abc', errors='replace')
u'\ufffdabc'
>>> unicode('\x80abc', errors='ignore')
u'abc'

While this might not be the most expedient method, this is a method that has worked for me.

EDIT:

A couple of people in the comments have mentioned that this is a bad idea, even though the asker accepted the answer. It is NOT a great idea, it will screw things up if you are dealing with european and accented characters. However, this is something you can use if it is NOT production level code, if it is a personal project you are working on, and you need a quick fix to get things rolling. You will eventually need to fix it with the right methods, which are mentioned in the answers below.

Silas Ray · Answer

The 0xE3 codepoint is an 'a' with a tilde in Unicode. Your original string is most likely already in UTF-8, so you can't decode it using the default ASCII character set.

Python ASCII and Unicode decode error

Tags:

python

string

sqlite

character-encoding

Amitash

4 Answers

Ned Batchelder

Sylvain Defresne

Karthik Rangarajan

Silas Ray

Recent Activity

Donate For Us

Python ASCII and Unicode decode error

Tags:

python

string

sqlite

character-encoding

Amitash

4 Answers

Ned Batchelder

Sylvain Defresne

Karthik Rangarajan

Silas Ray

Related questions

Recent Activity

Donate For Us