Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC

I would like to send a large pandas.DataFrame to a remote server running MS SQL. The way I do it now is by converting a data_frame object to a list of tuples and then send it away with pyODBC's executemany() function. It goes something like this:

 import pyodbc as pdb   list_of_tuples = convert_df(data_frame)   connection = pdb.connect(cnxn_str)   cursor = connection.cursor()  cursor.fast_executemany = True  cursor.executemany(sql_statement, list_of_tuples)  connection.commit()   cursor.close()  connection.close() 

I then started to wonder if things can be sped up (or at least more readable) by using data_frame.to_sql() method. I have came up with the following solution:

 import sqlalchemy as sa   engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s" % cnxn_str)  data_frame.to_sql(table_name, engine, index=False) 

Now the code is more readable, but the upload is at least 150 times slower...

Is there a way to flip the fast_executemany when using SQLAlchemy?

I am using pandas-0.20.3, pyODBC-4.0.21 and sqlalchemy-1.1.13.

like image 860
J.K. Avatar asked Dec 28 '17 11:12

J.K.


People also ask

Is pandas To_sql fast?

to_sql seems to send an INSERT query for every row which makes it really slow. But since 0.24. 0 there is a method parameter in pandas. to_sql() where you can define your own insertion function or just use method='multi' to tell pandas to pass multiple rows in a single INSERT query, which makes it a lot faster.

What is the difference between PyODBC and Sqlalchemy?

PyODBC allows you connecting to and using an ODBC database using the standard DB API 2.0. SQL Alchemy is a toolkit that resides one level higher than that and provides a variety of features: Object-relational mapping (ORM) Query constructions.

Is pandas more powerful than SQL?

This main difference can mean that the two tools are separate, however, you can also perform several of the same functions in each respective tool, for example, you can create new features from existing columns in pandas, perhaps easier and faster than in SQL.


Video Answer


2 Answers

EDIT (2019-03-08): Gord Thompson commented below with good news from the update logs of sqlalchemy: Since SQLAlchemy 1.3.0, released 2019-03-04, sqlalchemy now supports engine = create_engine(sqlalchemy_url, fast_executemany=True) for the mssql+pyodbc dialect. I.e., it is no longer necessary to define a function and use @event.listens_for(engine, 'before_cursor_execute') Meaning the below function can be removed and only the flag needs to be set in the create_engine statement - and still retaining the speed-up.

Original Post:

Just made an account to post this. I wanted to comment beneath the above thread as it's a followup on the already provided answer. The solution above worked for me with the Version 17 SQL driver on a Microsft SQL storage writing from a Ubuntu based install.

The complete code I used to speed things up significantly (talking >100x speed-up) is below. This is a turn-key snippet provided that you alter the connection string with your relevant details. To the poster above, thank you very much for the solution as I was looking quite some time for this already.

import pandas as pd import numpy as np import time from sqlalchemy import create_engine, event from urllib.parse import quote_plus   conn =  "DRIVER={ODBC Driver 17 for SQL Server};SERVER=IP_ADDRESS;DATABASE=DataLake;UID=USER;PWD=PASS" quoted = quote_plus(conn) new_con = 'mssql+pyodbc:///?odbc_connect={}'.format(quoted) engine = create_engine(new_con)   @event.listens_for(engine, 'before_cursor_execute') def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):     print("FUNC call")     if executemany:         cursor.fast_executemany = True   table_name = 'fast_executemany_test' df = pd.DataFrame(np.random.random((10**4, 100)))   s = time.time() df.to_sql(table_name, engine, if_exists = 'replace', chunksize = None) print(time.time() - s) 

Based on the comments below I wanted to take some time to explain some limitations about the pandas to_sql implementation and the way the query is handled. There are 2 things that might cause the MemoryError being raised afaik:

1) Assuming you're writing to a remote SQL storage. When you try to write a large pandas DataFrame with the to_sql method it converts the entire dataframe into a list of values. This transformation takes up way more RAM than the original DataFrame does (on top of it, as the old DataFrame still remains present in RAM). This list is provided to the final executemany call for your ODBC connector. I think the ODBC connector has some troubles handling such large queries. A way to solve this is to provide the to_sql method a chunksize argument (10**5 seems to be around optimal giving about 600 mbit/s (!) write speeds on a 2 CPU 7GB ram MSSQL Storage application from Azure - can't recommend Azure btw). So the first limitation, being the query size, can be circumvented by providing a chunksize argument. However, this won't enable you to write a dataframe the size of 10**7 or larger, (at least not on the VM I am working with which has ~55GB RAM), being issue nr 2.

This can be circumvented by breaking up the DataFrame with np.split (being 10**6 size DataFrame chunks) These can be written away iteratively. I will try to make a pull request when I have a solution ready for the to_sql method in the core of pandas itself so you won't have to do this pre-breaking up every time. Anyhow I ended up writing a function similar (not turn-key) to the following:

import pandas as pd import numpy as np  def write_df_to_sql(df, **kwargs):     chunks = np.split(df, df.shape()[0] / 10**6)     for chunk in chunks:         chunk.to_sql(**kwargs)     return True 

A more complete example of the above snippet can be viewed here: https://gitlab.com/timelord/timelord/blob/master/timelord/utils/connector.py

It's a class I wrote that incorporates the patch and eases some of the necessary overhead that comes with setting up connections with SQL. Still have to write some documentation. Also I was planning on contributing the patch to pandas itself but haven't found a nice way yet on how to do so.

I hope this helps.

like image 81
7 revs, 3 users 97% Avatar answered Sep 18 '22 15:09

7 revs, 3 users 97%


After contacting the developers of SQLAlchemy, a way to solve this problem has emerged. Many thanks to them for the great work!

One has to use a cursor execution event and check if the executemany flag has been raised. If that is indeed the case, switch the fast_executemany option on. For example:

from sqlalchemy import event  @event.listens_for(engine, 'before_cursor_execute') def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):     if executemany:         cursor.fast_executemany = True 

More information on execution events can be found here.


UPDATE: Support for fast_executemany of pyodbc was added in SQLAlchemy 1.3.0, so this hack is not longer necessary.

like image 25
J.K. Avatar answered Sep 18 '22 15:09

J.K.