I got this very very frustrating error when inserting a certain string into my database. It said something like:
Python cannot decode byte characters, expecting unicode"
After a lot of searching, I saw that I could overcome this error by encoding my string into Unicode. I try to do this by decoding the string first and then encoding it in UTF-8 format. Like:
string = string.encode("utf8")
And I get the following error:
'ascii' codec can't decode byte 0xe3 in position 6: ordinal not in range(128)
I have been dying with this error! How do I fix it?
You need to take a disciplined approach. Pragmatic Unicode, or How Do I Stop The Pain? has everything you need.
If you get that error on that line of code, then the problem is that string
is a byte string, and Python 2 is implicitly trying to decode it to Unicode for you. But it isn't pure ascii. You need to know what the encoding is, and decode it properly.
The encode
method should be used on unicode
objects to convert them to a str
object with a given encoding. The decode
method should be used on str
objects of a given encoding to convert them unicode
objects.
I suppose that your database store strings in UTF-8. So when you get strings from the database, convert them to unicode
objects by doing str.decode('utf-8')
. Then only use unicode
objects in your python program (literals are defined with u'unicode string'
). And just before storing them in your database, convert them to str
objects with uni.encode('utf-8')
.
EDIT: As you can see from the downvotes, this is NOT THE BEST WAY TO DO IT. An excellent, and a highly recommended answer is immediately after this, so if you are looking for a good solution, please use that. This is a hackish solution that will not be kind to you at a later point of time.
I feel your pain, I've had a lot of problems with the same error. The simplest way I solved it (and this might not be the best way, and it depends on your application) was to convert things to unicode, and ignore errors. Here's an example from Unicode HOWTO - Python v2.7.3 documentation
>>> unicode('\x80abc', errors='strict')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
ordinal not in range(128)
>>> unicode('\x80abc', errors='replace')
u'\ufffdabc'
>>> unicode('\x80abc', errors='ignore')
u'abc'
While this might not be the most expedient method, this is a method that has worked for me.
EDIT:
A couple of people in the comments have mentioned that this is a bad idea, even though the asker accepted the answer. It is NOT a great idea, it will screw things up if you are dealing with european and accented characters. However, this is something you can use if it is NOT production level code, if it is a personal project you are working on, and you need a quick fix to get things rolling. You will eventually need to fix it with the right methods, which are mentioned in the answers below.
The 0xE3 codepoint is an 'a' with a tilde in Unicode. Your original string is most likely already in UTF-8, so you can't decode it using the default ASCII character set.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With