I've got 500K rows I want to insert into PostgreSQL using SQLAlchemy.
For speed, I'm inserting them using session.bulk_insert_mappings()
.
Normally, I'd break up the insert into smaller batches to minimize session
bookkeeping. However, bulk_insert_mappings()
uses dicts
and bypasses a lot of the traditional session bookkeeping.
Will I still see a speed improvement if I break the insert up into smaller discrete batches, say doing an insert every 10K rows?
If so, should I close the PG transaction after every 10K rows, or leave it open the whole time?
In my experience, you'll see substantial performance improvements if you use INSERT INTO tbl (column1, column2) VALUES (...), (...), ...;
as opposed to bulk_insert_mappings
, which uses executemany
. In this case you'll want to batch the rows at least on a statement level for sanity.
SQLAlchemy supports generating a multi-row VALUES
clause for a single INSERT
statement, so you don't have to hand-craft the statement.
Committing between batches probably won't have much of an effect on the performance, but the reason to do it would be to not keep an open transaction for too long, which could impact other transactions running on the server.
You can also experiment with using COPY
to load it into a temporary table, then INSERT
ing from that table.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With