Update: The real problem is that MySQL utf8 does not support four-byte UTF-8 characters.
There are several questions on this topic, but none of them seems to be my question exactly, except for maybe this one, where the accepted answer does not work for me.
I am coding in Python with the MySQLdb
module, and I want to put some text into a MySQL database. The database is configured for UTF-8, but the text occasionally contains non-UTF-8 four-byte-UTF-8 characters.
The Python code for the database modification looks like this:
connection = MySQLdb.connect(
'localhost',
'root',
'',
'mydatabase',
charset='utf8',
use_unicode=True)
cursor = connection.cursor()
cursor.execute(
'update mytable set entryContent=%s where entryName=%s',
(entryContent, entryName))
connection.commit()
And it currently produces this warning:
./myapp.py:233: Warning: Invalid utf8 character string: 'F09286'
(entry, word))
./myapp.py:233: Warning: Incorrect string value: '\xF0\x92\x86\xB7\xF0\x92...' for column 'entry' at row 1
(entryname, entrycontent))
When I look at what actually got into the database with the mysql
command-line client, I see the content truncated at the very first occurrence of a non-UTF-8 four-byte UTF-8 character.
I don't care about preserving the non-UTF-8 four-byte UTF-8 characters, so all I want to do is replace all non-UTF-8 four-byte UTF-8 characters with some other valid UTF-8 character, so I can put the text into the database.
You will need to set your table encoding to utf8mb4
to support 4 byte UTF-8 encoding - https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html
Also, the MySQL driver supports Unicode strings, so you should pass Unicode to free your code from encoding specifics:
E.g.
cursor.execute(u'update mytable set entryContent=%s where entryName=%s',
(entryContent.decode("utf-8"), entryName.decode("utf-8")))
Ideally, entryContent
and entryName
will have been decoded to Unicode earlier in your code when you first receive them. E.g. when opening a file or receiving from the network.
Turns out the problem is not that I am feeding non-UTF-8 characters to MySQL, but that I am feeding four-byte UTF-8 characters to MySQL when it supports only three-(and fewer)byte UTF-8 characters (according to this documentation)
This solution retains all the supported UTF-8 characters, and converts the unsupported UTF-8 characters to '?':
>>> print ''.join([c if len(c.encode('utf-8')) < 4 else '?' for c in u'Cognates include Hittite π·ππ π β(lΔman)'])
Cognates include Hittite ???? β(lΔman)
I can put this string into MySQL without the above warnings (and undesirable truncation).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With