Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQLAlchemy automap: Best practices for performance

Tags:

sqlalchemy

I'm building a python app around an existing (mysql) database and am using automap to infer tables and relationships:

    base = automap_base()

    self.engine = create_engine(
        'mysql://%s:%s@%s/%s?charset=utf8mb4' % (
            config.DB_USER, config.DB_PASSWD, config.DB_HOST, config.DB_NAME
        ), echo=False
    )

    # reflect the tables
    base.prepare(self.engine, reflect=True)

    self.TableName = base.classes.table_name

Using this I can do things like session.query(TableName) etc... However, I'm worried about performance, because every time the app runs it will do the whole inference again.

  • Is this a legitimate concern?
  • If so, is there a possibility to 'cache' the output of Automap?
like image 657
user2449397 Avatar asked Jan 09 '17 12:01

user2449397


People also ask

What is Automap in SQLAlchemy?

Define an extension to the sqlalchemy. ext. declarative system which automatically generates mapped classes and relationships from a database schema, typically though not necessarily one which is reflected.

How fast is SQLAlchemy?

SQLAlchemy is very, very fast. It's just that users tend to be unaware of just how much functionality is being delivered, and confuse an ORM result set with that of a raw database cursor.

Should I use SQLAlchemy core or ORM?

If you want to view your data in a more schema-centric view (as used in SQL), use Core. If you have data for which business objects are not needed, use Core. If you view your data as business objects, use ORM. If you are building a quick prototype, use ORM.


3 Answers

Performance can be a legitimate concern. If the database schema is not changing, it can be time consuming to reflect the database every time a script is run. This is more of an issue during development, not starting up a long running application. It's also a significant time saver if your database is on a remote server (again, particularly during development).

I use code that is similar to the answer here (as noted by @ACV). The general plan is to perform the reflection the first time, then pickle the metadata object. The next time the script is run, it will look for the pickle file and use that. The file can be anywhere, but I place mine in ~/.sqlalchemy_cache. This is an example based on your code.

import os
from sqlalchemy.ext.declarative import declarative_base

self.engine = create_engine(
    'mysql://%s:%s@%s/%s?charset=utf8mb4' % (
        config.DB_USER, config.DB_PASSWD, config.DB_HOST, config.DB_NAME
    ), echo=False
)

metadata_pickle_filename = "mydb_metadata"
cache_path = os.path.join(os.path.expanduser("~"), ".sqlalchemy_cache")
cached_metadata = None
if os.path.exists(cache_path):
try:
    with open(os.path.join(cache_path, metadata_pickle_filename), 'rb') as cache_file:
        cached_metadata = pickle.load(file=cache_file)
except IOError:
    # cache file not found - no problem, reflect as usual
    pass

if cached_metadata:
    base = declarative_base(bind=self.engine, metadata=cached_metadata)
else:
    base = automap_base()
    base.prepare(self.engine, reflect=True) # reflect the tables

    # save the metadata for future runs
    try:
        if not os.path.exists(cache_path):
            os.makedirs(cache_path)
        # make sure to open in binary mode - we're writing bytes, not str
        with open(os.path.join(cache_path, metadata_pickle_filename), 'wb') as cache_file:
            pickle.dump(Base.metadata, cache_file)
    except:
        # couldn't write the file for some reason
        pass

self.TableName = base.classes.table_name

For anyone using declarative table class definitions, assuming a Base object defined as e.g.

Base = declarative_base(bind=engine)

metadata_pickle_filename = "ModelClasses_trilliandb_trillian.pickle"

# ------------------------------------------
# Load the cached metadata if it's available
# ------------------------------------------
# NOTE: delete the cached file if the database schema changes!!
cache_path = os.path.join(os.path.expanduser("~"), ".sqlalchemy_cache")
cached_metadata = None
if os.path.exists(cache_path):
    try:
        with open(os.path.join(cache_path, metadata_pickle_filename), 'rb') as cache_file:
            cached_metadata = pickle.load(file=cache_file)
    except IOError:
        # cache file not found - no problem
        pass
# ------------------------------------------

# define all tables
#
class MyTable(Base):
    if cached_metadata:
        __table__ = cached_metadata.tables['my_schema.my_table']
    else:
        __tablename__ = 'my_table'
        __table_args__ = {'autoload':True, 'schema':'my_schema'}

...
# ----------------------------------------
# If no cached metadata was found, save it
# ----------------------------------------
if cached_metadata is None:
    # cache the metadata for future loading
    # - MUST DELETE IF THE DATABASE SCHEMA HAS CHANGED
    try:
        if not os.path.exists(cache_path):
            os.makedirs(cache_path)
        # make sure to open in binary mode - we're writing bytes, not str
        with open(os.path.join(cache_path, metadata_pickle_filename), 'wb') as cache_file:
            pickle.dump(Base.metadata, cache_file)
    except:
        # couldn't write the file for some reason
        pass

Important Note!! If the database schema changes, you must delete the cached file to force the code to autoload and create a new cache. If you don't, the changes will be be reflected in the code. It's an easy thing to forget.

like image 197
Demitri Avatar answered Sep 20 '22 08:09

Demitri


I think that "reflecting" the structure of your database is not the way to go. Unless your app tries to "infer" things from the structure, like static code analysis would for source files, then it is unnecessary. The other reason for reflecting it at run-time would be the reduced time to begin "using" the database using SQLAlchemy. However:

Another option would be to use something like SQLACodegen (https://pypi.python.org/pypi/sqlacodegen):

It will "reflect" your database once and create a 99.5% accurate set of declarative SQLAlchemy models for you to work with. However, this does require that you keep the model subsequently in-sync with the structure of the database. I would assume that this is not a big concern seeing as the tables you're already-working with are stable-enough such that run-time reflection of their structure does not impact your program much.

Generating the declarative models is essentially a "cache" of the reflection. It's just that SQLACodegen saved it into a very readable set of classes + fields instead of data in-memory. Even with a changing structure, and my own "changes" to the generated declarative models, I still use SQLACodegen later-on in a project whenever I make database changes. It means that my models are generally consistent amongst one-another and that I don't have things such as typos and data-mismatches due to copy-pasting.

like image 45
Zoran Pavlovic Avatar answered Sep 21 '22 08:09

Zoran Pavlovic


The answer to your first question is largely subjective. You are adding database queries to fetch the reflection metadata to the application load time. Whether or not that overhead is significant depends on your project requirements.

For reference, I have an internal tool at work that uses a reflection pattern because the the load-time is acceptable for our team. That might not be the case if it were an externally-facing product. My hunch is that for most applications the reflection overhead will not dominate the total application load time.

If you decide it is significant for your purposes, this question has an interesting answer where the user pickles the database metadata in order to locally cache it.

like image 35
ACV Avatar answered Sep 22 '22 08:09

ACV