I found out that SQLAlchemy translates
db.query(...).filter(A.id.in_(ids))
into
SELECT ...
FROM a
WHERE a.id != a.id
if ids
is empty. That results in sequential scan on a
table, which is obviously disastrous for performance.
The first question is: why? Why not just 1 = 0
or anything that does not require sequential scan?
The second, more important: is is there a commonly used workaround (other than if
near every in_
)?
I guess that in_
cannot be easily reimplemented to cover all cases without causing that problem, but I cannot be the first to face it and there may be some solution covering simple, common use cases of in_
.
SQLAlchemy logs a warning every time it happens:
"The IN-predicate on 'foo.bar' was invoked with an empty sequence. This results in a contradiction, which nonetheless can be expensive to evaluate. Consider alternative strategies for improved performance."
(This is now mainly of historical interest, as SQLAlchemy fixed this bug long ago).
To answer the OP's question of "why", here is the FAQ entry (which I always find surprisingly difficult to find):
Why does
.col.in_([])
producecol != col
? Why not1=0
?A little introduction to the issue. The
IN
operator in SQL, given a list of elements to compare against a column, generally does not accept an empty list, that is while it is valid to say:column IN (1, 2, 3)
it’s not valid to say:
column IN ()
SQLAlchemy's
Operators.in_()
operator, when given an empty list, produces this expression:column != column
As of version 0.6, it also produces a warning stating that a less efficient comparison operation will be rendered. This expression is the only one that is both database agnostic and produces correct results.
For example, the naive approach of "just evaluate to false, by comparing 1=0 or 1!=1", does not handle nulls properly. An expression like:
NOT column != column
will not return a row when
column IS NULL
, but an expression which does not take the column into account, such as:NOT 1=0
will return a row.
As shown in this post, you may be able to use the ANY function to avoid this, since it's syntactically valid even for an empty list (but not supported on SQLite apparently). It's probably faster for large lists too, since it does less string mangling to build the query.
The performance issue with the in_
operator has recently been fixed and the fix will probably be in SQLAlchemy 1.2.0.
I'm using:
if len(ids) > 0:
db.query(...).where(A.id.in_(ids))
else:
db.query(...).where(False)
I tried a .limit(0)
instead of the .where(false)
without success. There is some behind-the-scenes difference in the empty querysets that broke other stuff down the pipeline. This workaround, while could be faster, at least avoids your mentioned warning.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With