Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQLAlchemy relationship with secondary table joining behaviour changes between lazy and eager loading

I've been playing with SQL Alchemy for a couple of months now and so far been really impressed with it.

There is one issue I've run into now that seems to be a bug, but I'm not sure that I'm doing the right thing. We use MS SQL here, with table reflection to define the table classes, however I can replicate the problem using an in-memory SQLite database, code for which I have included here.

What I am doing is defining a many to many relationship between two tables using a linking table between them. There is one extra piece of information that the linking table contains which I want to use for filtering the links, requiring the use of a primaryjoin statement on the relationship. This works perfectly for lazy loading, however for performance reasons we need eager loading and thats where it all falls over.

If I define the relationship with lazy loading:

activefunds = relationship('Fund', secondary='fundbenchmarklink',
                           primaryjoin='and_(FundBenchmarkLink.isactive==True,'
                                       'Benchmark.id==FundBenchmarkLink.benchmarkid,'
                                       'Fund.id==FundBenchmarkLink.fundid)')

and query the DB normally:

query = session.query(Benchmark)

The behaviour I need is exactly what I want, though performance is really bad, due to the extra SQL queries when iterating through all of the benchmarks and their respective funds.

If I define the relationship with eager loading:

activefunds = relationship('Fund', secondary='fundbenchmarklink',
                           primaryjoin='and_(FundBenchmarkLink.isactive==True,'
                                       'Benchmark.id==FundBenchmarkLink.benchmarkid,'
                                       'Fund.id==FundBenchmarkLink.fundid)',
                           lazy='joined')

and query the DB normally:

query = session.query(Benchmark)

it blows up in my face:

sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such column: fund.id 
    [SQL: 'SELECT benchmark.id AS benchmark_id,
                   benchmark.name AS benchmark_name,
                   fund_1.id AS fund_1_id,
                   fund_1.name AS fund_1_name,
                   fund_2.id AS fund_2_id,
                   fund_2.name AS fund_2_name 
            FROM benchmark
            LEFT OUTER JOIN (fundbenchmarklink AS fundbenchmarklink_1
                             JOIN fund AS fund_1 ON fund_1.id = fundbenchmarklink_1.fundid) ON benchmark.id = fundbenchmarklink_1.benchmarkid
            LEFT OUTER JOIN (fundbenchmarklink AS fundbenchmarklink_2
                             JOIN fund AS fund_2 ON fund_2.id = fundbenchmarklink_2.fundid) ON fundbenchmarklink_2.isactive = 1
            AND benchmark.id = fundbenchmarklink_2.benchmarkid
            AND fund.id = fundbenchmarklink_2.fundid']

The SQL above clearly shows the linked table is not being joined before attempting to access columns from it.

If I query the DB, specifically joining the linked table:

query = session.query(Benchmark).join(FundBenchmarkLink, Fund, isouter=True)

It works, however it means I now have to make sure that whenever I query the Benchmark table, I always have to define the join to add both of the extra tables.

Is there something I'm missing, is this a potential bug, or is it simply the way the library works?

Full working sample code to replicate issue:

import logging

logging.basicConfig(level=logging.INFO)
logging.getLogger('sqlalchemy.engine.base').setLevel(logging.INFO)

from sqlalchemy import Column, DateTime, String, Integer, Boolean, ForeignKey, create_engine
from sqlalchemy.orm import relationship, sessionmaker
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()


class FundBenchmarkLink(Base):
    __tablename__ = 'fundbenchmarklink'

    fundid = Column(Integer, ForeignKey('fund.id'), primary_key=True, autoincrement=False)
    benchmarkid = Column(Integer, ForeignKey('benchmark.id'), primary_key=True, autoincrement=False)
    isactive = Column(Boolean, nullable=False, default=True)

    fund = relationship('Fund')
    benchmark = relationship('Benchmark')

    def __repr__(self):
        return "<FundBenchmarkLink(fundid='{}', benchmarkid='{}', isactive='{}')>".format(self.fundid, self.benchmarkid, self.isactive)


class Benchmark(Base):
    __tablename__ = 'benchmark'

    id = Column(Integer, primary_key=True)
    name = Column(String, nullable=False)

    funds = relationship('Fund', secondary='fundbenchmarklink', lazy='joined')

    # activefunds has additional filtering on the secondary table, requiring a primaryjoin statement.
    activefunds = relationship('Fund', secondary='fundbenchmarklink',
                               primaryjoin='and_(FundBenchmarkLink.isactive==True,'
                                           'Benchmark.id==FundBenchmarkLink.benchmarkid,'
                                           'Fund.id==FundBenchmarkLink.fundid)',
                               lazy='joined')

    def __repr__(self):
        return "<Benchmark(id='{}', name='{}')>".format(self.id, self.name)


class Fund(Base):
    __tablename__ = 'fund'

    id = Column(Integer, primary_key=True)
    name = Column(String, nullable=False)

    def __repr__(self):
        return "<Fund(id='{}', name='{}')>".format(self.id, self.name)


if '__main__' == __name__:
    engine = create_engine('sqlite://')
    Base.metadata.create_all(engine)
    maker = sessionmaker(bind=engine)

    session = maker()

    # Create some data
    for bmkname in ['foo', 'bar', 'baz']:
        bmk = Benchmark(name=bmkname)
        session.add(bmk)

    for fname in ['fund1', 'fund2', 'fund3']:
        fnd = Fund(name=fname)
        session.add(fnd)

    session.add(FundBenchmarkLink(fundid=1, benchmarkid=1))
    session.add(FundBenchmarkLink(fundid=2, benchmarkid=1))
    session.add(FundBenchmarkLink(fundid=1, benchmarkid=2))
    session.add(FundBenchmarkLink(fundid=2, benchmarkid=2, isactive=False))

    session.commit()

    # This code snippet works when activefunds doesn't exist, or doesn't use eager loading
    # query = session.query(Benchmark)
    # print(query)

    # for bmk in query:
    #     print(bmk)
    #     for fund in bmk.funds:
    #         print('\t{}'.format(fund))

    # This code snippet works for activefunds with eager loading
    query = session.query(Benchmark).join(FundBenchmarkLink, Fund, isouter=True)
    print(query)

    for bmk in query:
        print(bmk)
        for fund in bmk.activefunds:
            print('\t{}'.format(fund))
like image 762
Simon Pratt Avatar asked Apr 07 '26 13:04

Simon Pratt


1 Answers

I think you've mixed the primary join and the secondary join a bit. Your primary would seem to contain both at the moment. Remove the predicate for Fund and it should work:

activefunds = relationship(
    'Fund',
    secondary='fundbenchmarklink',
    primaryjoin='and_(FundBenchmarkLink.isactive==True,'
                'Benchmark.id==FundBenchmarkLink.benchmarkid)',
    lazy='joined')

The reason why your explicit join seems to fix the query is that it introduces the table fund before the implicit eager loading joins and so they can refer to it. It's not really a fix, rather than it hides the error. If you really want to use explicit Query.join() with eagerloading, inform the query about it with contains_eager(). Just be careful which relationship you choose as being contained, depending on the query in question; without additional filtering you could fill activefunds with inactive also.

Finally, consider using Query.outerjoin() instead of Query.join(..., isouter=True).

like image 104
Ilja Everilä Avatar answered Apr 09 '26 02:04

Ilja Everilä



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!