I have the following query optimization problem in Spanner, and hoping there's a trick I'm missing that will help me bend the query planner to my will.
Here's the simplified schema:
create table T0 (
key0 int64 not null,
value int64,
other int64 not null,
) primary key (key0);
create table T1 {
key1 int64 not null,
other int64 not null
} primary key (key1);
And a query with a subquery in an IN
clause:
select value from T0 t0
where t0.other in (
select t1.other from T1 t1 where t1.key1 in (42, 43, 44) -- note: this subquery is a good deal more complex than this
)
Which produces a 10 element set, via a hash join of T0 against the output of the subquery:
Operator Rows Executions
----------------------- ----- ----------
Serialize Result 10 1
Hash Join 10 1
Distributed union 10000 1
Local distributed union 10000 1
Table Scan: T0 10000 1
Distributed cross apply: 5 1
...lots moar T1 subquery stuff...
Note that, while the subquery is complex, it actually produces a very small set. Unfortunately, it also scans the entirety of T1 to feed to the hash join, which is very slow.
However, if I take the output of the subquery on T1 and manually shove it into the IN
clause:
select value from T0
where other in (5, 6, 7, 8, 9) -- presume this `IN` clause to be the output of the above subquery
It is dramatically faster, presumably because it just hits T0's index once per entry, not using a hash join on the full contents:
Operator Rows Executions
----------------------- ---- ----------
Distributed union 10 1
Local distributed union 10 1
Serialize Result 10 1
Filter 10 1
Index Scan: 10 1
I could simply run two queries, and that's my best plan so far. But I'm hoping I can find some way to cajole Spanner into deciding that this is what it ought to do with the output of the subquery in the first example. I've tried everything I can think of, but this may simply not be expressible in SQL at all.
Also: I haven't quite proven this yet, but in some cases I fear that the 10 element subquery output could blow up to a few thousand elements (T1 will grow more or less without bound, easily to millions). I've manually tested with a few hundred elements in the splatted-out IN
clause and it seems to perform acceptably, but I'm a little concerned it could get out of hand.
Note that I also tried a join on the subquery, like so:
select t0.other from T0 t0
join (
-- Yes, this could be a simple join rather than a subquery, but in practice it's complex
-- enough that it can't be expressed that way.
select t1.other from T1 t1 where t1.key = 42
) sub on sub.other = t0.other
But it did something truly horrifying in the query planner, that I won't even try to explain here.
As described in Query execution plans, Cloud Spanner's SQL compiler transforms a SQL statement into a query execution plan, which is used to obtain the results of the query. This page describes best practices for constructing SQL statements to help Cloud Spanner find efficient execution plans.
When you use SQL to query multiple tables, Cloud Spanner automatically uses a join method that is likely to make the query more efficient. However, Google advises you to test with different join algorithms. Choosing the right join algorithm can improve latency, memory consumption, or both.
Tl;dr: Avoid joins on large tables and evaluate parts of queries beforehand to get 100–10,000x performance gains! As mentioned in a previous post, because of some of our tables growing in size, our queries started performing poorly which resulted in a performance hit to our most used APIs.
If you want to avoid Cloud Spanner from having to fetch any rows from the base table at all, you can optionally store a copy of the FirstName column in the index itself: Using a STORING clause like this costs extra storage but it provides the following advantages for queries and read calls using the index:
Does your actual subquery in the IN
clause use any variables from T0
?
If not, what happens if you try your join query with the tables reordered (and a distinct added for correctness, unless you know that the values will be distinct)?
SELECT t0.other FROM (
-- Yes, this could be a simple join rather than a subquery, but in practice it's complex
-- enough that it can't be expressed that way.
SELECT DISTINCT t1.other FROM T1 t1 WHERE t1.key = 42
) sub
JOIN T0 t0
ON sub.other = t0.other
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With