Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

New/override SQLAlchemy operator compiler output

The default SQLAlchemy behavior for compiling in_ expressions is pathological for very large lists, and I want to create a custom, faster, compiler for the operator. It doesn't matter to the application if the solution is a new operator (i.e.: in_list_) or if it overrides the default compiler for in_. However, I haven't been able to find any documentation on how to do this specifically.

The subclassing guidelines for compilation extension don't include anything about operators, suggesting that is not the place to start. The documentation on redefining and creating new operators is focused on changing or creating new operator behavior, but the behavior of the operator is not the problem, just the compiler.

Here is a very non-working example of what I'm trying to accomplish:

from sqlalchemy.types import TypeEngine

class in_list_(TypeEngine.Comparator):
  pass

@compiles(in_list_)
def in_list_impl(element, compiler, **kwargs):
  return "IN ('Now', 'I', 'can', 'inline', 'the', 'list')"

And then in an expression:

select([mytable.c.x, mytable.c.y]).where(mytable.c.x.in_list_(long_list))
like image 455
wst Avatar asked Sep 07 '19 00:09

wst


1 Answers

Using IN for very large lists is indeed pathological, and you might be better served using a temporary table and IN against a subquery or a join. But the question was "how to override compiler output for specific operator". In case of binary operators such as IN and NOT IN what you need to override is how SQLAlchemy handles compiling BinaryExpressions:

from sqlalchemy.ext.compiler import compiles
from sqlalchemy.sql.elements import BinaryExpression
from sqlalchemy.sql.operators import in_op, notin_op

def visit_in_op_binary(compiler, binary, operator, **kw):
    return "%s IN %s" % (
        compiler.process(binary.left, **kw),
        compiler.process(binary.right, **{**kw, "literal_binds": True}))

def visit_notin_op_binary(compiler, binary, operator, **kw):
    return "%s NOT IN %s" % (
        compiler.process(binary.left, **kw),
        compiler.process(binary.right, **{**kw, "literal_binds": True}))

@compiles(BinaryExpression)
def compile_binary(binary, compiler, override_operator=None, **kw):
    operator = override_operator or binary.operator

    if operator is in_op:
        return visit_in_op_binary(
            compiler, binary, operator, override_operator=override_operator,
            **kw)

    if operator is notin_op:
        return visit_notin_op_binary(
            compiler, binary, operator, override_operator=override_operator,
            **kw)

    return compiler.visit_binary(binary, override_operator=override_operator, **kw)

Note that simply producing the binary expression containing the grouping and clause list of bind params takes a surprisingly lot of time for very large lists, not to mention compiling all that even if using literal binds, so you might not observe significant performance gains. On the other hand many implementations have limits on how many placeholders / parameters you can use in a statement, and so inlining the binds allows such queries to run at all.

If on the other hand your list does fit within the limits set by your implementation (Postgresql seems to be only limited by available RAM), you may not need any compiler workarounds with a recent enough SQLAlchemy; use expanding bind parameters instead:

In [15]: %%time
    ...: session.query(Foo).\
    ...:     filter(Foo.data.in_(range(250000))).\
    ...:     all()
    ...: 
CPU times: user 5.09 s, sys: 91.9 ms, total: 5.18 s
Wall time: 5.18 s
Out[15]: []

In [16]: %%time
    ...: session.query(Foo).\
    ...:     filter(Foo.data.in_(bindparam('xs', range(250000), expanding=True))).\
    ...:     all()
    ...: 
CPU times: user 310 ms, sys: 8.05 ms, total: 318 ms
Wall time: 317 ms
Out[16]: []

And as mentioned in comments, in version 1.4 an expanding bindparam will support literal execution out of the box:

In [4]: session.query(Foo).\
   ...:     filter(Foo.data.in_(
   ...:         bindparam('xs', range(10), expanding=True, literal_execute=True))).\
   ...:     all()
2019-09-07 20:35:04,560 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2019-09-07 20:35:04,561 INFO sqlalchemy.engine.base.Engine SELECT foo.id AS foo_id, foo.data AS foo_data 
FROM foo 
WHERE foo.data IN (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
2019-09-07 20:35:04,561 INFO sqlalchemy.engine.base.Engine ()
Out[4]: []
like image 145
Ilja Everilä Avatar answered Nov 11 '22 23:11

Ilja Everilä