Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQLAlchemy Many-To-Many performance

I have a database relationship with a Many-To-Many association but the association table itself contains a lot of attributes that need to be accessed, so I made three classes:

class User(Base):
    id = Column(Integer, primary_key=True)
    attempts = relationship("UserAttempt", backref="user", lazy="subquery")

class Challenge(Base):
    id = Column(Integer, primary_key=True)
    attempts = relationship("UserAttempt", backref="challenge", lazy='subquery')

class UserAttempt(Base):
    challenge_id = Column(Integer, ForeignKey('challenge.id'), primary_key=True)
    user_id = Column(Integer, ForeignKey('user.id'), primary_key=True)

This is a simplified case, of course, where I left out the other attributes that I need to access. The purpose here is, that each User can attempt any number of Challenges, hence the UserAttempt table which described one particular user working one challenge.

The problem now: When I query for all Users and then look at each attempt, I am perfectly fine. But when I look at the challenge for this attempt, it explodes in numerous subqueries. Of course, this is bad for performance.

What I actually want from SQLAlchemy is to pull all (or all relevant) Challenges at once and then associate it with the relevant attempts. It is not a big deal if all challenges are pulled or only does which have an actual association later, as this the number of challenges is only between 100-500.

My solution right now is actually not very elegant: I pull all relevant attempts, challenges and users seperately and then associate by hand: Loop through all attempts and assign add to the challenge & user, then add the challenge & user to the attempt as well. That seems to me like a brutal solution that should not be necessary.

However, every approach (e.g. varying "lazy" parameters, altered queries, etc.) have led to queries from hundreds to thousands. I have also tried to write plain SQL queries that would yield my desired results and have come up with something along the lines of SELECT * FROM challenge WHERE id IN (SELECT challenge_id FROM attempts) and that worked well, but I cannot get it translated to SQLAlchemy

Thank you very much in advance for any guidance you may have to offer.

like image 567
javex Avatar asked Dec 27 '22 08:12

javex


1 Answers

What I actually want from SQLAlchemy is to pull all (or all relevant) Challenges at once and then associate it with the relevant attempts. It is not a big deal if all challenges are pulled or only does which have an actual association later,

You first want to take off that "lazy='subquery'" directive from relationship() first; fixing relationships to always load everything is why you're getting the explosion of queries. Specifically here, you're getting that Challenge->attempts eagerload exactly for each lazyload of UserAttempt->Challenge so you've sort of designed the worst possible loading combination here :).

With that fixed, there's two approaches.

One is to keep in mind that many-to-one association in the usual case is fetched from the Session in memory first by primary key, and if present, no SQL is emitted. So I think you could get exactly the effect it seems like you're describing using a technique I use often:

all_challenges = session.query(Challenge).all()

for user in some_users:    # however you got these
    for attempt in user.attempts:   # however you got these
        do_something_with(attempt.challenge)  # no SQL will be emitted

If you wanted to use the above approach with exactly the "Select * from challenge where id in (select challenge_id from attempt)":

all_challenges = session.query(Challenge).\
                  filter(Challenge.id.in_(session.query(UserAttempt.challenge_id))).all()

though this is likely more efficient as a JOIN:

all_challenges = session.query(Challenge).\
                  join(Challenge.attempts).all()

or DISTINCT, I guess the join would return the same challenge.id as it appears in UserAttempt:

all_challenges = session.query(Challenge).distinct().\
                  join(Challenge.attempts).all()

The other way is to use eager loading more specifically. you can query for a bunch of users/attempts/challenges within one query that will emit three SELECT statements:

users = session.query(User).\
              options(subqueryload_all(User.attempts, UserAttempt.challenge)).all()

or because UserAttempt->Challenge is many-to-one, a join might be better:

users = session.query(User).\
                  options(subqueryload(User.attempts), joinedload(UserAttempt.challenge)).all()

just from UserAttempt:

attempts = session.query(UserAttempt).\
                  options(joinedload(UserAttempt.challenge)).all()
like image 147
zzzeek Avatar answered Dec 28 '22 21:12

zzzeek