Considering following three methods of using sqlalchemy ORM to insert objects:
(1)
for obj in objects:
session.add(obj)
(2)
session.add_all(objects)
(3)
session.bulk_save_objects(objects)
Suppose the length of objects[]
is 50000
50000
insert SQL queries?1
SQL query?1
SQL query?I know these three methods differ a lot in speed. But what are the difference regarding to the underlying implementation details?
autoflush – The autoflush setting to use with newly created Session objects. autocommit – The autocommit setting to use with newly created Session objects.
Advertisements. In order to interact with the database, we need to obtain its handle. A session object is the handle to database. Session class is defined using sessionmaker() – a configurable session factory method which is bound to the engine object created earlier.
session. rollback() to discard your changes. Pushing object changes to the database means your database now holds the changes in its transaction buffer.
SQLAlchemy is very, very fast. It's just that users tend to be unaware of just how much functionality is being delivered, and confuse an ORM result set with that of a raw database cursor.
(2) is basically implemented as (1), and both may emit 50,000 inserts during flush, if the ORM has to fetch generated values such as primary keys. They may even emit more, if those 50,000 objects have relationships that cascade.
In [4]: session.add_all([Foo() for _ in range(5)]) In [5]: session.commit() BEGIN (implicit) INSERT INTO foo DEFAULT VALUES RETURNING foo.id {} ... (repeats 3 times) INSERT INTO foo DEFAULT VALUES RETURNING foo.id {} COMMIT
If you provide primary keys and other DB generated values beforehand, then the Session
can combine separate inserts to a single "executemany" operation when the arguments match.
In [8]: session.add_all([Foo(id=i) for i in range(5)]) In [9]: session.commit() BEGIN (implicit) INSERT INTO foo (id) VALUES (%(id)s) ({'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}) COMMIT
If your DB-API driver implements executemany()
or equivalent using a method that allows it to issue a single statement with multiple data, then it can result in a single query. For example after enabling executemany_mode='values'
the Postgresql log contains for the above
LOG: statement: INSERT INTO foo (id) VALUES (0),(1),(2),(3),(4)
The bulk operation skips most of the Session
machinery — such as persisting related objects — in exchange for performance gains. For example by default it does not fetch default values, such as primary keys, which allows it to try and batch changes to fewer "executemany" operations where the operation and arguments match.
In [12]: session.bulk_save_objects([Foo() for _ in range(5)]) BEGIN (implicit) INSERT INTO foo DEFAULT VALUES ({}, {}, {}, {}, {}) In [13]: session.commit() COMMIT
It may still emit multiple statements, again depending on the data, and the DB-API driver in use. The documentation is a good read.
With psycopg2 fast execution helpers enabled the above produces in the Postgresql log
LOG: statement: INSERT INTO foo DEFAULT VALUES;INSERT INTO foo DEFAULT VALUES;INSERT INTO foo DEFAULT VALUES;INSERT INTO foo DEFAULT VALUES;INSERT INTO foo DEFAULT VALUES
In other words multiple statements have been joined to a "single" statement sent to the server.
So, in the end the answer to all 3 is "it depends", which of course may seem frustrating.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With