Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UnicodeDecodeError Loading with sqlalchemy

I am querying a MySQL database with sqlalchemy and getting the following error:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 498-499: unexpected end of data

A column in the table was defined as Unicode(500) so this error suggests to me that there is an entry that was truncated because it was longer than 500 characters. Is there a way to handle this error and still load the entry? Is there a way to find the errant entry and delete it other than trying to load every entry one by one (or in batches) until I get the error?

like image 910
Michael Avatar asked Aug 09 '16 16:08

Michael


2 Answers

In short, you should change:

Unicode(500)

to:

Unicode(500, unicode_errors='ignore', convert_unicode='force')

(Python 2 code follows, but the principles hold in python 3; only some of the output will differ.)

What's going on is that when you decode a bytestring, it complains if the bytestring can't be decoded, with the error you saw.

>>> u = u'ABCDEFGH\N{TRADE MARK SIGN}'
>>> u
u'ABCDEFGH\u2122'
>>> print(u)
ABCDEFGH™
>>> s = u.encode('utf-8')
>>> s
'ABCDEFGH\xe2\x84\xa2'
>>> truncated = s[:-1]
>>> truncated
'ABCDEFGH\xe2\x84'        
>>> truncated.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/cliffdyer/.virtualenvs/edx-platform/lib/python2.7/encodings/utf_8.py", 
line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-9: unexpected 
end of data

Python provides different optional modes of handling decode errors, though. Raising an exception is the default, but you can also truncate the text or convert the malformed part of the string to the official unicode replacement character.

>>> trunc.decode('utf-8', errors='replace')
u'ABCDEFGH\ufffd'
>>> trunc.decode('utf-8', errors='ignore')
u'ABCDEFGH'

This is exactly what's happening within the column handling.

Looking at the Unicode and String classes in sqlalchemy/sql/sqltypes.py, it looks like there is a unicode_errors argument that you can pass to the constructor which passes its value through to the encoder's errors argument. There is also a note that you will need to set convert_unicode='force' to make it work.

Thus Unicode(500, unicode_errors='ignore', convert_unicode='force') should solve your problem, if you're okay with truncating the ends of your data.

If you have some control over the database, you should be able to prevent this issue in the future by defining your database to use the utf8mb4 character set. (Don't just use utf8, or it will fail on four byte utf8 characters, including most emojis). Then you will be guaranteed to have valid utf-8 stored in and returned from your database.

like image 74
jcdyer Avatar answered Oct 12 '22 13:10

jcdyer


In short, your MySQL setup is incorrect in that it truncates UTF-8 characters in mid-sequence. I would check twice that MySQL actually expects the character encoding of UTF-8 within the sessions and in the tables themselves.


I would suggest switching to PostgreSQL (seriously) to avoid this kind of problem: not only does PostgreSQL understand UTF-8 properly in default configurations, but also it would not ever truncate a string to fit into the value, choosing to raise an error instead:

psql (9.5.3, server 9.5.3)
Type "help" for help.

testdb=> create table foo(bar varchar(4));
CREATE TABLE
testdb=> insert into foo values ('aaaaa');
ERROR:  value too long for type character varying(4)

This is also not unlike the Zen of Python:

Explicit is better than implicit.

and

Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.