Look at the following:
/home/kinka/workspace/py/tutorial/tutorial/pipelines.py:33: Warning: Incorrect string
value: '\xF0\x9F\x91\x8A\xF0\x9F...' for column 't_content' at row 1
n = self.cursor.execute(self.sql, (item['topic'], item['url'], item['content']))
The string '\xF0\x9F\x91\x8A
, actually is a 4-byte unicode: u'\U0001f62a'
. The mysql's character-set is utf-8 but inserting 4-byte unicode it will truncate the inserted string.
I googled for such a problem and found that mysql under 5.5.3 don't support 4-byte unicode, and unfortunately mine is 5.5.224.
I don't want to upgrade the mysql server, so I just want to filter the 4-byte unicode in python, I tried to use regular expression but failed.
So, any help?
If MySQL cannot handle UTF-8 codes of 4 bytes or more then you'll have to filter out all unicode characters over codepoint \U00010000
; UTF-8 encodes codepoints below that threshold in 3 bytes or fewer.
You could use a regular expression for that:
>>> import re
>>> highpoints = re.compile(u'[\U00010000-\U0010ffff]')
>>> example = u'Some example text with a sleepy face: \U0001f62a'
>>> highpoints.sub(u'', example)
u'Some example text with a sleepy face: '
Alternatively, you could use the .translate()
function with a mapping table that only contains None
values:
>>> nohigh = { i: None for i in xrange(0x10000, 0x110000) }
>>> example.translate(nohigh)
u'Some example text with a sleepy face: '
However, creating the translation table will eat a lot of memory and take some time to generate; it is probably not worth your effort as the regular expression approach is more efficient.
This all presumes you are using a UCS-4 compiled python. If your python was compiled with UCS-2 support then you can only use codepoints up to '\U0000ffff'
in regular expressions and you'll never run into this problem in the first place.
I note that as of MySQL 5.5.3 the newly-added utf8mb4
codec does supports the full Unicode range.
I think you should use utf8mb4 collation instead of utf8 and run
SET NAMES UTF8MB4
after connection with DB (link, link, link)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With