Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQLAlchemy "default" vs "server_default" performance

Is there a performance advantage (or disadvantage) when using default instead of server_default for mapping table column default values when using SQLAlchemy with PostgreSQL?

My understanding is that default renders the expression in the INSERT (usually) and that server_default places the expression in the CREATE TABLE statement. Seems like server_default is analogous to typical handling of defaults directly in the db such as:

CREATE TABLE example (
    id serial PRIMARY KEY,
    updated timestamptz DEFAULT now()
);

...but it is not clear to me if it is more efficient to handle defaults on INSERT or via table creation.

Would there be any performance improvement or degradation for row inserts if each of the default parameters in the example below were changed to server_default?

from uuid import uuid4
from sqlalchemy import Column, Boolean, DateTime, Integer
from sqlalchemy.dialects.postgresql import UUID
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.sql import func

Base = declarative_base()

class Item(Base):
    __tablename__ = 'item'

    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid4)
    count = Column(Integer, nullable=False, default=0)
    flag = Column(Boolean, nullable=False, default=False)
    updated = Column(DateTime(timezone=True), nullable=False, default=func.now())

NOTE: The best explanation I found so far for when to use default instead of server_default does not address performance (see Mike Bayer's SO answer on the subject). My oversimplified summary of that explanation is that default is preferred over server_default when...

  • The db can't handle the expression you need or want to use for the default value.
  • You can't or don't want to modify the schema directly.

...so the question remains as to whether performance should be considered when choosing between default and server_default?

like image 694
benvc Avatar asked Sep 20 '18 18:09

benvc


2 Answers

It is impossible to give you a 'this is faster' answer, because performance per default value expression can vary widely, both on the server and in Python. A function to retrieve the current time behaves differently from a scalar default value.

Next, you must realise that defaults can be provided in five different ways:

  • Client-side scalar defaults. A fixed value, such a 0 or True. The value is used in an INSERT statement.
  • Client-side Python function. Called each time a default is needed, produces the value to insert, used the same way as a scalar default from there on out. These can be context sensitive (have access to the current execution context with values to be inserted).
  • Client-side SQL expression; this generates an extra piece of SQL expression that is then used in the query and executed on the server to produce a value.
  • Server-side DLL expression are SQL expressions that are then stored in the table definition, so are part of the schema. The server uses these to fill a value for any columns omitted from INSERT statements, or when a column value is set to DEFAULT in an INSERT or UPDATE statement.
  • Server-side implicit defaults or triggers, where other DLL such as triggers or specific database features provide a default value for columns.

Note that when it comes to a SQL expression determining the default value, be that a client-side SQL expression, a server-side DLL expression, or a trigger, it makes very little difference to a database where the default value expression is coming from. The query executor will need to know how to produce values for a given column, once that's parsed out of the DML statement or the schema definition, the server still has to execute the expression for each row.

Choosing between these options is rarely going to be based on performance alone, performance should at most be but one of multiple aspects you consider. There are many factors involved here:

  • default with a scalar or Python function directly produces a Python default value, then sends the new value to the server when inserting. Python code can access the default value before the data is inserted into the database.
  • A client-side SQL expression, a server_default value, and server-side implicit defaults and triggers all have the server generate the default, which then must be fetched by the client if you want to be able to access it in the same SQLAlchemy session. You can't access the value until the object has been inserted into the database.

    Depending on the exact query and database support, SQLAlchemy may have to make extra SQL queries to either generate a default before the INSERT statement or run a separate SELECT afterwards to fetch the defaults that have been inserted. You can control when this happens (directly when inserting or on first access after flushing, with the eager_defaults mapper configuration).

  • If you have multiple clients on different platforms accessing the same database, a server_default or other default attached to the schema (such as a trigger) ensures that all clients will use the same defaults, regardless, while defaults implemented in Python can't be accessed by other platforms.

When using PostgreSQL, SQLAlchemy can make use of the RETURNING clause for DML statements, which gives a client access to server-side generated defaults in a single step.

So when using a server_default column default that calculates a new value for each row (not a scalar value), you save a small amount of Python-side time, and save a small amount of network bandwidth as you are not sending data for that column over to the database. The database could be faster creating that same value, or it could be slower; it largely depends on the type of operation. If you need to have access to the generated default value from Python, in the same transaction, you do then have to wait for a return stream of data, parsed out by SQLAlchemy. All these details can become insignificant compared to everything else that happens around inserting or updating rows, however.

Do understand that a ORM is not suitable to be used for high-performance bulk row inserts or updates; quoting from the SQAlchemy Performance FAQ entry:

The SQLAlchemy ORM uses the unit of work pattern when synchronizing changes to the database. This pattern goes far beyond simple “inserts” of data. It includes that attributes which are assigned on objects are received using an attribute instrumentation system which tracks changes on objects as they are made, includes that all rows inserted are tracked in an identity map which has the effect that for each row SQLAlchemy must retrieve its “last inserted id” if not already given, and also involves that rows to be inserted are scanned and sorted for dependencies as needed. Objects are also subject to a fair degree of bookkeeping in order to keep all of this running, which for a very large number of rows at once can create an inordinate amount of time spent with large data structures, hence it’s best to chunk these.

Basically, unit of work is a large degree of automation in order to automate the task of persisting a complex object graph into a relational database with no explicit persistence code, and this automation has a price.

ORMs are basically not intended for high-performance bulk inserts - this is the whole reason SQLAlchemy offers the Core in addition to the ORM as a first-class component.

Because an ORM like SQLAlchemy comes with a hefty overhead price, any performance differences between a server-side or Python-side default quickly disappears in the noise of ORM operations.

So if you are concerned about performance for large-quantity insert or update operations, you would want to use bulk operations for those, and enable the psycopg2 batch execution helpers to really get a speed boost. When using these bulk operations, I'd expect server-side defaults to improve performance just by saving bandwidth moving row data from Python to the server, but how much depends on the exact nature of the default values.

If ORM insert and update performance outside of bulk operations is a big issue for you, you need to test your specific options. I'd start with the SQLAlchemy examples.performance package and add your own test suite using two models that differ only in a single server_default and default configuration.

like image 169
Martijn Pieters Avatar answered Oct 11 '22 06:10

Martijn Pieters


There's something else important rather than just comparing the performance of the two

If you needed to add a new Column create_at (Not Null) to an existing Table User with some data in it, default will not work.

If used default, during upgrading the database, the error will occur saying cannot insert Null value to existing data in the table. And this will cause significant troubles if you want to maintain your data, even just for testing.

And when used server_default, during upgrading the DB, database will insert the current DateTime value to all previous existing testing data.

So in this case, only server_default will work.

like image 22
Gary Bao 鲍昱彤 Avatar answered Oct 11 '22 06:10

Gary Bao 鲍昱彤