Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improve pandas' to_sql() performance with SQL Server

I come to you because i cannot fix an issues with pandas.DataFrame.to_sql() method.

I've made the connection between my script and my database, i can send queries, but actually it's too slow for me.

I would like to find a way to improve the performance of my script on this. Maybe someone will find a solution?

Here is my code :

  engine = sqlalchemy.create_engine(con['sql']['connexion_string'])
  conn = engine.connect()
  metadata = sqlalchemy.Metadata()
  try : 
    if(con['sql']['strategy'] == 'NEW'): 
      query = sqlalchemy.Table(con['sql']['table'],metadata).delete()
      conn.execute(query)
      Sql_to_deploy.to_sql(con['sql']['table'],engine,if_exists='append',index = False,chunksize = 1000,method = 'multi')
    elif(con['sql']['strategy'] == 'APPEND'):
      Sql_to_deploy.to_sql(con['sql']['table'],engine,if_exists='append',index = False,chunksize = 1000,method = 'multi')
    else:
      pass
  except Exception as e:
    print(type(e))

It's working and too slow when i retire chunksize and method parameters,it's this moment where it's too slow (almost 3 minutes for 30 thousand lines). When i put these parameters, i get an sqlalchemy.exc.ProgrammingError...

thanks for your help !

like image 459
Neal Poidras Avatar asked May 02 '26 06:05

Neal Poidras


1 Answers

For mssql+pyodbc you will get the best performance from to_sql if you

  1. use Microsoft's ODBC Driver for SQL Server, and
  2. enable fast_executemany=True in your create_engine call.

For example, this code runs in just over 3 seconds on my network:

from time import perf_counter
import pandas as pd
import sqlalchemy as sa

ngn_local = sa.create_engine("mssql+pyodbc://mssqlLocal64")
ngn_remote = sa.create_engine(
    (
        "mssql+pyodbc://sa:[email protected]/mydb"
        "?driver=ODBC+Driver+17+for+SQL+Server"
    ),
    fast_executemany=True,
)

df = pd.read_sql_query(
    "SELECT * FROM MillionRows WHERE ID <= 30000", ngn_local
)

t0 = perf_counter()
df.to_sql("pd_test", ngn_remote, index=False, if_exists="replace")
print(f"{perf_counter() - t0} seconds")

whereas with fast_executemany=False (which is the default) the same process takes 143 seconds (2.4 minutes).

like image 167
Gord Thompson Avatar answered May 03 '26 21:05

Gord Thompson



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!