Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Warning raised by inserting 4-byte unicode to mysql

Look at the following:

/home/kinka/workspace/py/tutorial/tutorial/pipelines.py:33: Warning: Incorrect string 
value: '\xF0\x9F\x91\x8A\xF0\x9F...' for column 't_content' at row 1
n = self.cursor.execute(self.sql, (item['topic'], item['url'], item['content']))

The string '\xF0\x9F\x91\x8A, actually is a 4-byte unicode: u'\U0001f62a'. The mysql's character-set is utf-8 but inserting 4-byte unicode it will truncate the inserted string. I googled for such a problem and found that mysql under 5.5.3 don't support 4-byte unicode, and unfortunately mine is 5.5.224. I don't want to upgrade the mysql server, so I just want to filter the 4-byte unicode in python, I tried to use regular expression but failed. So, any help?

like image 900
Kinka Avatar asked May 29 '12 11:05

Kinka


2 Answers

If MySQL cannot handle UTF-8 codes of 4 bytes or more then you'll have to filter out all unicode characters over codepoint \U00010000; UTF-8 encodes codepoints below that threshold in 3 bytes or fewer.

You could use a regular expression for that:

>>> import re
>>> highpoints = re.compile(u'[\U00010000-\U0010ffff]')
>>> example = u'Some example text with a sleepy face: \U0001f62a'
>>> highpoints.sub(u'', example)
u'Some example text with a sleepy face: '

Alternatively, you could use the .translate() function with a mapping table that only contains None values:

>>> nohigh = { i: None for i in xrange(0x10000, 0x110000) }
>>> example.translate(nohigh)
u'Some example text with a sleepy face: '

However, creating the translation table will eat a lot of memory and take some time to generate; it is probably not worth your effort as the regular expression approach is more efficient.

This all presumes you are using a UCS-4 compiled python. If your python was compiled with UCS-2 support then you can only use codepoints up to '\U0000ffff' in regular expressions and you'll never run into this problem in the first place.

I note that as of MySQL 5.5.3 the newly-added utf8mb4 codec does supports the full Unicode range.

like image 148
Martijn Pieters Avatar answered Sep 17 '22 08:09

Martijn Pieters


I think you should use utf8mb4 collation instead of utf8 and run

SET NAMES UTF8MB4

after connection with DB (link, link, link)

like image 45
Dmitry Avatar answered Sep 21 '22 08:09

Dmitry