How can I filter Emoji characters from my input so I can save in MySQL <5.5?

Q: How do I support emojis in mysql?

You need to use utf8mb4 encoding in your mysql database in order to be able to store emojis.

Tags:

I have a Django app that takes tweet data from Twitter's API and saves it in a MySQL database. As far as I know (I'm still getting my head around the finer points of character encoding) I'm using UTF-8 everywhere, including MySQL encoding and collation, which works fine except when a tweet contains Emoji characters, which I understand use a four-byte encoding. Trying to save them produces the following warnings from Django:

/home/biggleszx/.virtualenvs/myvirtualenv/lib/python2.6/site-packages/django/db/backends/mysql/base.py:86: Warning: Incorrect string value: '\xF0\x9F\x98\xAD I...' for column 'text' at row 1 return self.cursor.execute(query, args)

I'm using MySQL 5.1, so using utf8mb4 isn't an option unless I upgrade to 5.5, which I'd rather not just yet (also from what I've read, Django's support for this isn't quite production-ready, though this might no longer be accurate). I've also seen folks advising the use of BLOB instead of TEXT on affected columns, which I'd also rather not do as I figure it would harm performance.

My question is, then, assuming I'm not too bothered about 100% preservation of the tweet contents, is there a way I can filter out all Emoji characters and replace them with a non-multibyte character, such as the venerable WHITE MEDIUM SMALL SQUARE (U+25FD)? I figure this is the easiest way to save that data given my current setup, though if I'm missing another obvious solution, I'd love to hear it!

FYI, I'm using the stock Python 2.6.5 on Ubuntu 10.04.4 LTS. sys.maxunicode is 1114111, so it's a UCS-4 build.

Thanks for reading.

504

asked Dec 05 '12 18:12

BigglesZX

2 Answers

So it turns out this has been answered a few times, I just hadn't quite got the right Google-fu to find the existing questions.

Python, convert 4-byte char to avoid MySQL error "Incorrect string value:"
Warning raised by inserting 4-byte unicode to mysql

Thanks to Martijn Pieters, the solution came from the world of regular expressions, specifically this code (based on his answer to the first link above):

import re
try:
    # UCS-4
    highpoints = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
    # UCS-2
    highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
# mytext = u'<some string containing 4-byte chars>'
mytext = highpoints.sub(u'\u25FD', mytext)

The character I'm replacing with is the WHITE MEDIUM SMALL SQUARE (U+25FD), FYI, but could be anything.

For those unfamiliar with UCS, like me, this is a system for Unicode conversion and a given build of Python will include support for either the UCS-2 or UCS-4 variant, each of which has a different upper bound on character support.

With the addition of this code, the strings seem to persist in MySQL 5.1 just fine.

Hope this helps anyone else in the same situation!

178

answered Sep 18 '22 16:09

BigglesZX

I tryied the solution by BigglesZX and its wasn't woring for the emoji of the heart (❤) after reading the [emoji's wikipedia article][1] I've seen that the regular expression is not covering all the emojis while also covering other range of unicode that are not emojis.

The following code create the 5 regular expressions that cover the 5 emoji blocks in the standard:

emoji_symbols_pictograms = re.compile(u'[\U0001f300-\U0001f5fF]')
emoji_emoticons = re.compile(u'[\U0001f600-\U0001f64F]')
emoji_transport_maps = re.compile(u'[\U0001f680-\U0001f6FF]')
emoji_symbols = re.compile(u'[\U00002600-\U000026FF]')
emoji_dingbats = re.compile(u'[\U00002700-\U000027BF]')

Those blocks could be merged in three blocks (UCS-4):

emoji_block0 = re.compile(u'[\U00002600-\U000027BF]')
emoji_block1 = re.compile(u'[\U0001f300-\U0001f64F]')
emoji_block2 = re.compile(u'[\U0001f680-\U0001f6FF]')

Their equivalents in UCS-2 are:

emoji_block0 = re.compile(u'[\u2600-\u27BF]')
emoji_block1 = compile(u'[\uD83C][\uDF00-\uDFFF]')
emoji_block1b = compile(u'[\uD83D][\uDC00-\uDE4F]')
emoji_block2 = re.compile(u'[\uD83D][\uDE80-\uDEFF]')

So finally we can define a single regular expression with all the cases together:

import re
try:
    # UCS-4
    highpoints = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
except re.error:
    # UCS-2
    highpoints = re.compile(u'([\u2600-\u27BF])|([\uD83C][\uDF00-\uDFFF])|([\uD83D][\uDC00-\uDE4F])|([\uD83D][\uDE80-\uDEFF])')
# mytext = u'<some string containing 4-byte chars>'
mytext = highpoints.sub(u'\u25FD', mytext)

answered Sep 19 '22 16:09

David Mabodo

Related questions
                            
                                Gunicorn will not bind to my application
                            
                                Control tick labels in Python seaborn package
                            
                                Tabs in print are not consistent python
                            
                                pandas to_datetime parsing wrong year
                            
                                Is there a way to execute jq from python
                            
                                Numpy:zero mean data and standardization
                            
                                matplotlib.scatter() not working with Numpy on Python 3.6
                            
                                Python: pass "not" as a lambda function [duplicate]
                            
                                Practical GUI toolkit?
                            
                                Programming with hardware in python [closed]
                            
                                Python package install using pip or easy_install from repos
                            
                                Python: create a function to modify a list by reference not value
                            
                                How to use NumPy array with ctypes?
                            
                                Ruby's tap idiom in Python
                            
                                Difficulty with Django and jQuery (why is $ undefined in the admin app?)
                            
                                Loop over widgets in PyQt Layout
                            
                                One liner to determine if dictionary values are all empty lists or not
                            
                                Django: how to hide/overwrite default label with ModelForm?
                            
                                Why does Python's Queue return an approximate size in qsize()?
                            
                                Python + Django on Android

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I filter Emoji characters from my input so I can save in MySQL <5.5?

Tags:

python

mysql

character-encoding

utf-8

django

BigglesZX

People also ask

2 Answers

BigglesZX

David Mabodo

Recent Activity

Donate For Us