Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fast_executemany alternative for psycopg2

I have a Redshift server, which I initiate via psycopg2 (note that ODBC is not supported on the company server so I cant use pyodbc).

Currently it is taken over 10 minutes for 30-35k rows via pd.to_sql(), which writes from dataframe into the Redshift DB. So as a work-around I download DF as csv, push file to S3, and then use copy to write into the DB.

The fast_executemany solution as per Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC would have been perfect- however this is not supported in psycopg2. I also found d6tstack as per https://github.com/d6t/d6tstack/blob/master/examples-sql.ipynb but pd_to_psql doesn't work for Redshift, only Postgresql (can't copy... from stdin)

Any alternatives I can use for my case?

This is my code:

import sqlalchemy as sa

DATABASE = ""
USER = ""
PASSWORD = ""
HOST = "...us-east-1.redshift.amazonaws.com"
PORT = "5439"
SCHEMA = "public" 

server = "redshift+psycopg2://%s:%s@%s:%s/%s" % (USER,PASSWORD,HOST,str(PORT),DATABASE)
engine = sa.create_engine(server)
conn = engine.raw_connection()

with conn.cursor() as cur:
    cur.execute('truncate table_name')

df.to_sql('table_name', engine, index=False, if_exists='append') 
like image 321
user8834780 Avatar asked Dec 06 '18 16:12

user8834780


1 Answers

If you are unable to use COPY from S3 and must rely on DML, you could try passing use_batch_mode=True to create_engine():

engine = create_engine('theurl', use_batch_mode=True)

From this machine a simple insert of 500 rows to a Redshift cluster showed reasonable improvement with batch mode enabled:

In [31]: df = pd.DataFrame({'batchno': range(500)})

In [32]: %time df.to_sql('batch', engine, index=False, if_exists='append')
CPU times: user 87.8 ms, sys: 57.6 ms, total: 145 ms
Wall time: 1min 6s

In [33]: %time df.to_sql('batch', bm_engine, index=False, if_exists='append')
CPU times: user 10.3 ms, sys: 4.66 ms, total: 15 ms
Wall time: 9.96 s

Note that Pandas versions 0.23.0, and 0.24.0 and up will not benefit from using batch mode, since they use multi values insert instead of executemany, if the underlying DBMS supports it. Using multi values insert should provide somewhat similar improvements on throughput, as fewer queries are issued.

like image 118
Ilja Everilä Avatar answered Nov 02 '22 05:11

Ilja Everilä