Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to explicitly declare charset=utf8 for Airflow connections

This sequence:

from airflow.hooks.mysql_hook import MySqlHook
conn = MySqlHook(mysql_conn_id='conn_id')
engine = conn.get_sqlalchemy_engine()
df.to_sql('test_table', engine, if_exists='append', index=False)

produces the following:

UnicodeEncodeError: 'latin-1' codec can't encode character '\ufffd' in position 57: ordinal not in range(256)

This sequence works great:

from sqlalchemy import create_engine
engine = create_engine("mysql://{0}:{1}@{2}/capone?charset=utf8".format(user, pwd, host))
df.to_sql('test_table', engine, if_exists='append', index=False)

The key is in explicitly declaring the charset. I have attempted to do this in airflow as follows with {"charset": "utf8"}:

enter image description here

But this has not fixed the error. I've restarted my dev environment since making the changes and the admin panel lets me know that the edit was successful. How can I work with Airflow connections to my charsets as utf8?

like image 762
aaron Avatar asked Sep 06 '17 21:09

aaron


People also ask

How to set Default Charset to UTF-8 in Apache using htaccess?

Here are the steps to set default charset to UTF-8 encoding in Apache using htaccess. 1. Open htaccess file Before proceeding, please enable mod_rewrite (.htaccess) in your Apache web server. Open .htaccess file, typically located at /var/www/html/.htaccess

What is the difference between ASCII and UTF-8?

The most commonly used encodings are UTF-8 and UTF-16: A character in UTF8 can be from 1 to 4 bytes long. UTF-8 can represent any character in the Unicode standard. UTF-8 is backwards compatible with ASCII.

How do I enable UTF-8 encoding in htaccess?

Open .htaccess file, typically located at /var/www/html/.htaccess 2. Set Default Encoding to UTF-8 Add the following to .htaccess file. This will be applicable for all files served by Apache web server. If you want to apply UTF-8 encoding only for specific file types, use the filesMatch directive.

What is UTF-8 and UTF-16?

UTF-8 is the preferred encoding for e-mail and web pages 16-bit Unicode Transformation Format is a variable-length character encoding for Unicode, capable of encoding the entire Unicode repertoire. UTF-16 is used in major operating systems and environments, like Microsoft Windows, Java and .NET.


2 Answers

I realised that this is a bug in Airflow and I have reported it here: https://issues.apache.org/jira/browse/AIRFLOW-4824

For now I have a workaround with the following code:

def get_uri(hook):
    conn = hook.get_connection(getattr(hook, hook.conn_name_attr))
    login = ''
    if conn.login:
        login = '{conn.login}:{conn.password}@'.format(conn=conn)
    host = conn.host
    if conn.port is not None:
        host += ':{port}'.format(port=conn.port)
    charset = ''
    if conn.extra_dejson.get('charset', False):
        chrs = conn.extra_dejson["charset"]
        if chrs.lower() == 'utf8' or chrs.lower() == 'utf-8':
            charset = '?charset=utf8'
    return '{conn.conn_type}://{login}{host}/{conn.schema}{charset}'.format(
        conn=conn, login=login, host=host, charset=charset)

And then use it as follows:

url = get_uri(sql_hook)
from sqlalchemy import create_engine
engine = create_engine(url)

The real solution will be to send a pull request to the project overriding get_uri in mysql_hook.py.

like image 182
lolcode Avatar answered Oct 11 '22 05:10

lolcode


from sqlalchemy import create_engine
from airflow.hooks.mysql_hook import MySqlHook

conn = MySqlHook(mysql_conn_id='conn_id')
uri = conn.get_uri()
engine = create_engine(uri+'?charset=utf8')
df.to_sql('test_table', engine, if_exists='append', index=False)

I fixed the problem by the code above.

like image 41
Lvtao Wang Avatar answered Oct 11 '22 06:10

Lvtao Wang