I have a Redshift server, which I initiate via psycopg2 (note that ODBC is not supported on the company server so I cant use pyodbc).
Currently it is taken over 10 minutes for 30-35k rows via pd.to_sql()
, which writes from dataframe into the Redshift DB. So as a work-around I download DF as csv, push file to S3, and then use copy
to write into the DB.
The fast_executemany
solution as per Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC would have been perfect- however this is not supported in psycopg2
.
I also found d6tstack
as per https://github.com/d6t/d6tstack/blob/master/examples-sql.ipynb but pd_to_psql
doesn't work for Redshift, only Postgresql (can't copy... from stdin
)
Any alternatives I can use for my case?
This is my code:
import sqlalchemy as sa
DATABASE = ""
USER = ""
PASSWORD = ""
HOST = "...us-east-1.redshift.amazonaws.com"
PORT = "5439"
SCHEMA = "public"
server = "redshift+psycopg2://%s:%s@%s:%s/%s" % (USER,PASSWORD,HOST,str(PORT),DATABASE)
engine = sa.create_engine(server)
conn = engine.raw_connection()
with conn.cursor() as cur:
cur.execute('truncate table_name')
df.to_sql('table_name', engine, index=False, if_exists='append')
If you are unable to use COPY
from S3 and must rely on DML, you could try passing use_batch_mode=True
to create_engine()
:
engine = create_engine('theurl', use_batch_mode=True)
From this machine a simple insert of 500 rows to a Redshift cluster showed reasonable improvement with batch mode enabled:
In [31]: df = pd.DataFrame({'batchno': range(500)})
In [32]: %time df.to_sql('batch', engine, index=False, if_exists='append')
CPU times: user 87.8 ms, sys: 57.6 ms, total: 145 ms
Wall time: 1min 6s
In [33]: %time df.to_sql('batch', bm_engine, index=False, if_exists='append')
CPU times: user 10.3 ms, sys: 4.66 ms, total: 15 ms
Wall time: 9.96 s
Note that Pandas versions 0.23.0, and 0.24.0 and up will not benefit from using batch mode, since they use multi values insert instead of executemany, if the underlying DBMS supports it. Using multi values insert should provide somewhat similar improvements on throughput, as fewer queries are issued.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With