Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it faster to batch my inserts when using SQLAlchemy bulk_insert_mappings()?

I've got 500K rows I want to insert into PostgreSQL using SQLAlchemy.

For speed, I'm inserting them using session.bulk_insert_mappings().

Normally, I'd break up the insert into smaller batches to minimize session bookkeeping. However, bulk_insert_mappings() uses dicts and bypasses a lot of the traditional session bookkeeping.

Will I still see a speed improvement if I break the insert up into smaller discrete batches, say doing an insert every 10K rows?

If so, should I close the PG transaction after every 10K rows, or leave it open the whole time?

like image 537
Jeff Widman Avatar asked Apr 27 '16 00:04

Jeff Widman


1 Answers

In my experience, you'll see substantial performance improvements if you use INSERT INTO tbl (column1, column2) VALUES (...), (...), ...; as opposed to bulk_insert_mappings, which uses executemany. In this case you'll want to batch the rows at least on a statement level for sanity.

SQLAlchemy supports generating a multi-row VALUES clause for a single INSERT statement, so you don't have to hand-craft the statement.

Committing between batches probably won't have much of an effect on the performance, but the reason to do it would be to not keep an open transaction for too long, which could impact other transactions running on the server.

You can also experiment with using COPY to load it into a temporary table, then INSERTing from that table.

like image 96
univerio Avatar answered Sep 26 '22 17:09

univerio