I am using SQLAlchemy's ORM. I have a model that has multiple many-to-many relationships: <pre class="prettyprint"><code>User User <--MxN--> Organization User <--MxN--> School User <--MxN--> Credentials </code></pre> I am implementing these using association tables, so there are also User_to_Organization, User_to_School and User_to_Credentials tables that I don't directly use. Now, when I attempt to load a single User (using its PK identifier) and its relationships (and related models) using joined eager loading, I get horrible performance (15+ seconds). I assume this is due to this issue: <blockquote> When multiple levels of depth are used with joined or subquery loading, loading collections-within- collections will multiply the total number of rows fetched in a cartesian fashion. Both forms of eager loading always join from the original parent class. </blockquote> If I introduce another level or two to the hierarchy: <pre class="prettyprint"><code>Organization <--1xN--> Project School <--1xN--> Course Project <--MxN--> Credentials Course <--MxN--> Credentials </code></pre> The query takes 50+ seconds to complete, even though the total amount of records in each table is fairly small. Using lazy loading, I am required to manually load each relationship, and there are multiple round trips to the server. e.g. Operations, executed serially as queries: <ul> <li>Get user</li> <li>Get user's Organizations</li> <li>Get user's Schools</li> <li>Get user's credentials</li> <li>For each Organization, get its Projects</li> <li>For each School, get its Courses</li> <li>For each Project, get its Credentials</li> <li>For each Course, get its Credentials</li> </ul> Still, it all finishes in less than 200ms. I was wondering if there is anyway to indeed use lazy loading, but perform the relationship loading queries in parallel. For example, using the <code>concurrent</code> module, <code>asyncio</code> or by using <code>gevent</code>. e.g. Step 1 (in parallel): <ul> <li>Get user</li> <li>Get user's Organizations</li> <li>Get user's Schools</li> <li>Get user's credentials</li> </ul> Step 2 (in parallel): <ul> <li>For each Organization, get its Projects</li> <li>For each School, get its Courses</li> </ul> Step 3 (in parallel): <ul> <li>For each Project, get its Credentials</li> <li>For each Course, get its Credentials</li> </ul> Actually, at this point, making a subquery type load can also work, that is, return Organization and OrganizationID/Project/Credentials in two separate queries: e.g. Step 1 (in parallel): <ul> <li>Get user</li> <li>Get user's Organizations</li> <li>Get user's Schools</li> <li>Get user's credentials</li> </ul> Step 2 (in parallel): <ul> <li>Get Organizations</li> <li>Get Schools</li> <li>Get the Organizations' Projects, join with Credentials</li> <li>Get the Schools' Courses, join with Credentials</li> </ul>

The first thing you're going to want to do is check to see what queries are actually being executed on the db. I wouldn't assume that SQLAlchemy is doing what you expect unless you're very familiar with it. You can use <code>echo=True</code> on your engine configuration or look at some db logs (not sure how to do that with mysql). You've mentioned that you're using different loading strategies so I guess you've read through the docs on that ( http://docs.sqlalchemy.org/en/latest/orm/loading_relationships.html). For what you're doing, I'd probably recommend subquery load, but it totally depends on the number of rows / columns you're dealing with. In my experience it's a good general starting point though. One thing to note, you might need to something like: <code>db.query(Thing).options(subqueryload('A').subqueryload('B')).filter(Thing.id==x).first()</code> With <code>filter.first</code> rather that <code>get</code>, as the latter case won't re-execute queries according to your loading strategy if the primary object is already in the identity map. Finally, I don't know your data - but those numbers sound pretty abysmal for anything short of a huge data set. Check that you have the correct indexes specified on all your tables. You may have already been through all of this, but based on the information you've provided, it sounds like you need to do more work to narrow down your issue. Is it the db schema, or is it the queries SQLA is executing? Either way, I'd say, "no" to running multiple queries on different connections. Any attempt to do that could result in inconsistent data coming back to your app, and if you think you've got issues now..... :-)

How to instruct SQLAlchemy ORM to execute multiple queries in parallel when loading relationships?

Tags:

python

mysql

orm

parallel-processing

sqlalchemy

I am using SQLAlchemy's ORM. I have a model that has multiple many-to-many relationships:

User
User <--MxN--> Organization
User <--MxN--> School
User <--MxN--> Credentials

I am implementing these using association tables, so there are also User_to_Organization, User_to_School and User_to_Credentials tables that I don't directly use.

Now, when I attempt to load a single User (using its PK identifier) and its relationships (and related models) using joined eager loading, I get horrible performance (15+ seconds). I assume this is due to this issue:

When multiple levels of depth are used with joined or subquery loading, loading collections-within- collections will multiply the total number of rows fetched in a cartesian fashion. Both forms of eager loading always join from the original parent class.

If I introduce another level or two to the hierarchy:

Organization <--1xN--> Project
School <--1xN--> Course
Project <--MxN--> Credentials
Course <--MxN--> Credentials

The query takes 50+ seconds to complete, even though the total amount of records in each table is fairly small.

Using lazy loading, I am required to manually load each relationship, and there are multiple round trips to the server.

e.g. Operations, executed serially as queries:

Get user
Get user's Organizations
Get user's Schools
Get user's credentials
For each Organization, get its Projects
For each School, get its Courses
For each Project, get its Credentials
For each Course, get its Credentials

Still, it all finishes in less than 200ms.

I was wondering if there is anyway to indeed use lazy loading, but perform the relationship loading queries in parallel. For example, using the concurrent module, asyncio or by using gevent.

e.g. Step 1 (in parallel):

Get user
Get user's Organizations
Get user's Schools
Get user's credentials

Step 2 (in parallel):

For each Organization, get its Projects
For each School, get its Courses

Step 3 (in parallel):

For each Project, get its Credentials
For each Course, get its Credentials

Actually, at this point, making a subquery type load can also work, that is, return Organization and OrganizationID/Project/Credentials in two separate queries:

e.g. Step 1 (in parallel):

Get user
Get user's Organizations
Get user's Schools
Get user's credentials

Step 2 (in parallel):

Get Organizations
Get Schools
Get the Organizations' Projects, join with Credentials
Get the Schools' Courses, join with Credentials

965

asked Jan 24 '17 11:01

advance512

1 Answers

The first thing you're going to want to do is check to see what queries are actually being executed on the db. I wouldn't assume that SQLAlchemy is doing what you expect unless you're very familiar with it. You can use echo=True on your engine configuration or look at some db logs (not sure how to do that with mysql).

You've mentioned that you're using different loading strategies so I guess you've read through the docs on that ( http://docs.sqlalchemy.org/en/latest/orm/loading_relationships.html). For what you're doing, I'd probably recommend subquery load, but it totally depends on the number of rows / columns you're dealing with. In my experience it's a good general starting point though.

One thing to note, you might need to something like:

db.query(Thing).options(subqueryload('A').subqueryload('B')).filter(Thing.id==x).first()

With filter.first rather that get, as the latter case won't re-execute queries according to your loading strategy if the primary object is already in the identity map.

Finally, I don't know your data - but those numbers sound pretty abysmal for anything short of a huge data set. Check that you have the correct indexes specified on all your tables.

You may have already been through all of this, but based on the information you've provided, it sounds like you need to do more work to narrow down your issue. Is it the db schema, or is it the queries SQLA is executing?

Either way, I'd say, "no" to running multiple queries on different connections. Any attempt to do that could result in inconsistent data coming back to your app, and if you think you've got issues now..... :-)

156

answered Sep 19 '22 08:09

Aidan Kane

Related questions
                            
                                When should the "Natural Language" PyPI classifier be used?
                            
                                Invalid argument error from background process when main script stops
                            
                                Using __new__ to override __init__ in subclass
                            
                                How to plot 3D Earth in Python?
                            
                                Python - Using nonces with multithreading
                            
                                Spot the Difference, Celery Task Fails Randomly With No Errors
                            
                                Wrap CNTK Applications
                            
                                What technical limitations prevent the calculation of Graham's number in python?
                            
                                How to interpret matchtemplate output? (openCV, Python)
                            
                                How to pad and align unicode strings with special characters in python?
                            
                                How do I test a method that requires a file's presence?
                            
                                How "generate " multiple TCP clients using Threads instead of opening multiple instances of the terminal and run the script several times?
                            
                                os.environ doesn't show all environmental variables in Jupyter notebook
                            
                                mod_wsgi: Unable to stat Python home and ImportError: No module named 'encodings'
                            
                                what could cause html and script to behave different across iterations of a for loop?
                            
                                Tensorflow, try and except doesn't handle exception
                            
                                How to create a composite strategy, using multiple instruments, in Pyalgotrade?
                            
                                Simplest Way to Serve Jupyter Incubator Dashboards Locally
                            
                                Pycharm's terminal doesn't change the Python version corresponding to Python version at Project Interpreter
                            
                                What determines the size of int in numpy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With