Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to instruct SQLAlchemy ORM to execute multiple queries in parallel when loading relationships?

I am using SQLAlchemy's ORM. I have a model that has multiple many-to-many relationships:

User
User <--MxN--> Organization
User <--MxN--> School
User <--MxN--> Credentials

I am implementing these using association tables, so there are also User_to_Organization, User_to_School and User_to_Credentials tables that I don't directly use.

Now, when I attempt to load a single User (using its PK identifier) and its relationships (and related models) using joined eager loading, I get horrible performance (15+ seconds). I assume this is due to this issue:

When multiple levels of depth are used with joined or subquery loading, loading collections-within- collections will multiply the total number of rows fetched in a cartesian fashion. Both forms of eager loading always join from the original parent class.

If I introduce another level or two to the hierarchy:

Organization <--1xN--> Project
School <--1xN--> Course
Project <--MxN--> Credentials
Course <--MxN--> Credentials

The query takes 50+ seconds to complete, even though the total amount of records in each table is fairly small.

Using lazy loading, I am required to manually load each relationship, and there are multiple round trips to the server.

e.g. Operations, executed serially as queries:

  • Get user
  • Get user's Organizations
  • Get user's Schools
  • Get user's credentials
  • For each Organization, get its Projects
  • For each School, get its Courses
  • For each Project, get its Credentials
  • For each Course, get its Credentials

Still, it all finishes in less than 200ms.

I was wondering if there is anyway to indeed use lazy loading, but perform the relationship loading queries in parallel. For example, using the concurrent module, asyncio or by using gevent.

e.g. Step 1 (in parallel):

  • Get user
  • Get user's Organizations
  • Get user's Schools
  • Get user's credentials

Step 2 (in parallel):

  • For each Organization, get its Projects
  • For each School, get its Courses

Step 3 (in parallel):

  • For each Project, get its Credentials
  • For each Course, get its Credentials

Actually, at this point, making a subquery type load can also work, that is, return Organization and OrganizationID/Project/Credentials in two separate queries:

e.g. Step 1 (in parallel):

  • Get user
  • Get user's Organizations
  • Get user's Schools
  • Get user's credentials

Step 2 (in parallel):

  • Get Organizations
  • Get Schools
  • Get the Organizations' Projects, join with Credentials
  • Get the Schools' Courses, join with Credentials
like image 965
advance512 Avatar asked Jan 24 '17 11:01

advance512


People also ask

What is lazy loading in SQLAlchemy?

The loading of relationships falls into three categories; lazy loading, eager loading, and no loading. Lazy loading refers to objects are returned from a query without the related objects loaded at first.

How does the querying work with SQLAlchemy?

All SELECT statements generated by SQLAlchemy ORM are constructed by Query object. It provides a generative interface, hence successive calls return a new Query object, a copy of the former with additional criteria and options associated with it.

What is the difference between SQLAlchemy core and ORM?

Choosing Between SQLAlchemy Core and ORM The two modes use slightly different syntax, but the biggest difference between Core and ORM is the view of data as schema or business objects. SQLAlchemy Core has a schema-centric view, which like traditional SQL is focused around tables, keys, and index structures.

What is Backref in SQLAlchemy?

In Flask-SQLAlchemy, the backref parameter in relationship method allows you to declare a new property under a specified class as seen in the example in their docs: class Person(db.Model): id = db.Column(db.Integer, primary_key=True) name = db.Column(db.String(50)) addresses = db.relationship('Address', backref='person ...


1 Answers

The first thing you're going to want to do is check to see what queries are actually being executed on the db. I wouldn't assume that SQLAlchemy is doing what you expect unless you're very familiar with it. You can use echo=True on your engine configuration or look at some db logs (not sure how to do that with mysql).

You've mentioned that you're using different loading strategies so I guess you've read through the docs on that ( http://docs.sqlalchemy.org/en/latest/orm/loading_relationships.html). For what you're doing, I'd probably recommend subquery load, but it totally depends on the number of rows / columns you're dealing with. In my experience it's a good general starting point though.

One thing to note, you might need to something like:

db.query(Thing).options(subqueryload('A').subqueryload('B')).filter(Thing.id==x).first()

With filter.first rather that get, as the latter case won't re-execute queries according to your loading strategy if the primary object is already in the identity map.

Finally, I don't know your data - but those numbers sound pretty abysmal for anything short of a huge data set. Check that you have the correct indexes specified on all your tables.

You may have already been through all of this, but based on the information you've provided, it sounds like you need to do more work to narrow down your issue. Is it the db schema, or is it the queries SQLA is executing?

Either way, I'd say, "no" to running multiple queries on different connections. Any attempt to do that could result in inconsistent data coming back to your app, and if you think you've got issues now..... :-)

like image 156
Aidan Kane Avatar answered Sep 19 '22 08:09

Aidan Kane