Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I configure sqlalchemy to correctly store emoji?

With sqlalchemy 0.9.7, I am trying to store emoji into MySQL 5.5 with utf8mb4 enabled. However, for some reason sqlalchemy is killing my emoji chars and I cannot figure out why. I can see the emoji char in the object before attempting to save the data via sqlalchemy. After the save, an error is thrown and emoji is rendered as ????.

The error is the following. Note it is surrounded by debug messages that output type(post.message) and post.message.

--------------------------------------------------------------------------------
DEBUG in __init__ [/mnt/hgfs/crw/dev/hyper/hyper/blueprint/chat/__init__.py:274]:
<type 'unicode'>
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in __init__ [/mnt/hgfs/crw/dev/hyper/hyper/blueprint/chat/__init__.py:275]:
šŸ˜ƒ
--------------------------------------------------------------------------------
/mnt/hgfs/crw/dev/hyper/env/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py:436: Warning: Incorrect string value: '\xF0\x9F\x98\x83' for column 'message' at row 1
  cursor.execute(statement, parameters)
--------------------------------------------------------------------------------
DEBUG in __init__ [/mnt/hgfs/crw/dev/hyper/hyper/blueprint/chat/__init__.py:277]:
<type 'unicode'>
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in __init__ [/mnt/hgfs/crw/dev/hyper/hyper/blueprint/chat/__init__.py:278]:
????
--------------------------------------------------------------------------------

My code looks like:

Base = declarative_base()

post = Table('post', Base.metadata,
        Column("id", Integer, primary_key=True),
        Column("message", UnicodeText),
        Column("created_at", DateTime),
        Column("updated_at", DateTime),
    )

mapper(Post, post)

Is there any other setup I need to do to make this work?

Update: If I do the following:

Base = declarative_base()

post = Table('post', Base.metadata,
        Column("id", Integer, primary_key=True),
        Column("message", UnicodeText),
        Column("created_at", DateTime),
        Column("updated_at", DateTime),
        mysql_engine='InnoDB',
        mysql_charset='utf8mb4'
    )

mapper(Post, post)

I get the following:

--------------------------------------------------------------------------------
DEBUG in __init__ [/mnt/hgfs/crw/dev/hyper/hyper/blueprint/chat/__init__.py:274]:
<type 'unicode'>
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in __init__ [/mnt/hgfs/crw/dev/hyper/hyper/blueprint/chat/__init__.py:275]:
šŸ˜ƒ
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in __init__ [/mnt/hgfs/crw/dev/hyper/hyper/blueprint/chat/__init__.py:277]:
<type 'unicode'>
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in __init__ [/mnt/hgfs/crw/dev/hyper/hyper/blueprint/chat/__init__.py:278]:
????
--------------------------------------------------------------------------------

No error, but it is still obliterating the emoji. What is going on?!

Update 2:

Changing the database URI being called from:

mysql+mysqldb://user:pass@localhost/datab?charset=utf8

To

mysql+mysqldb://user:pass@localhost/datab

Fixed the problem. sqlalchemy threw an error when I tried to use charset=utf8mb4&use_unicode=0. But is that wise? According to http://docs.sqlalchemy.org/en/rel_0_9/dialects/mysql.html#module-sqlalchemy.dialects.mysql.mysqldb, it is not! Would love any feedback on this solution.

like image 484
Craig Wright Avatar asked Jul 29 '14 07:07

Craig Wright


2 Answers

Below's how I got emojis etc. working. I'm using Python 3.5 / Flask / Flask-SQLAlchemy.

Note: This fix assumes you're early enough in development that you're OK with recreating your database by running db.drop_all() and db.create_all().

  1. As recommended in this guide, open up a database console and run ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;

  2. As recommended in this answer, add ?charset=utf8mb4 to the end of your SQLALCHEMY_DATABASE_URI string.

    • Before: mysql+mysqlconnector://{username}:{password}@{hostname}/{databasename}
    • After: mysql+mysqlconnector://{username}:{password}@{hostname}/{databasename}?charset=utf8mb4
  3. Now just rerun db.drop_all() and db.create_all().

like image 134
Nathan Wailes Avatar answered Sep 25 '22 08:09

Nathan Wailes


Here is the solution that worked for me. set_unicode gets called the first time a connection enters the connection pool and only gets called once. SET NAMES basically fixes the entire pipeline from your code to the database storage to ensure the correct UTF-8 charset (utf8mb4) is used. As long as utf8mb4 is ensured end to end, emojis should be stored / rendered fine as they are just regular unicode characters outside of BMP (Basic Multilingual Plane), which unfortunately MySQL decided that it was a good idea to only implement 3-byte Unicode implementation for BMP as the unicode charset.

See SQLAlchemy events reference for other relevant event hooks that you can tap into to do connection by connection configuration tweaks.

import logging
from sqlalchemy import event

logger = logging.getLogger(__name__)     

@event.listens_for(Pool, "connect")
def set_unicode(dbapi_conn, conn_record):
    cursor = dbapi_conn.cursor()
    try:
        cursor.execute("SET NAMES 'utf8mb4' COLLATE 'utf8mb4_unicode_ci'")
    except Exception as e:
        logger.debug(e)

Update: regarding extra options in connection string, I prefer to be explicit and specify charset=utf8mb4 because my environment's default charset is utf8(which is the culprit of why emoji cannot be encoded correctly), but never specify use_unicode=0 unless you are running on python 2.x and performance is a bottleneck as you pointed out in the link.

like image 45
Devy Avatar answered Sep 25 '22 08:09

Devy