I have the following query optimization problem in Spanner, and hoping there's a trick I'm missing that will help me bend the query planner to my will. Here's the simplified schema: <pre class="prettyprint"><code>create table T0 ( key0 int64 not null, value int64, other int64 not null, ) primary key (key0); create table T1 { key1 int64 not null, other int64 not null } primary key (key1); </code></pre> And a query with a subquery in an <code>IN</code> clause: <pre class="prettyprint"><code>select value from T0 t0 where t0.other in ( select t1.other from T1 t1 where t1.key1 in (42, 43, 44) -- note: this subquery is a good deal more complex than this ) </code></pre> Which produces a 10 element set, via a hash join of T0 against the output of the subquery: <pre class="prettyprint"><code>Operator Rows Executions ----------------------- ----- ---------- Serialize Result 10 1 Hash Join 10 1 Distributed union 10000 1 Local distributed union 10000 1 Table Scan: T0 10000 1 Distributed cross apply: 5 1 ...lots moar T1 subquery stuff... </code></pre> Note that, while the subquery is complex, it actually produces a very small set. Unfortunately, it also scans the entirety of T1 to feed to the hash join, which is very slow. However, if I take the output of the subquery on T1 and manually shove it into the <code>IN</code> clause: <pre class="prettyprint"><code>select value from T0 where other in (5, 6, 7, 8, 9) -- presume this `IN` clause to be the output of the above subquery </code></pre> It is dramatically faster, presumably because it just hits T0's index once per entry, not using a hash join on the full contents: <pre class="prettyprint"><code>Operator Rows Executions ----------------------- ---- ---------- Distributed union 10 1 Local distributed union 10 1 Serialize Result 10 1 Filter 10 1 Index Scan: 10 1 </code></pre> I could simply run two queries, and that's my best plan so far. But I'm hoping I can find some way to cajole Spanner into deciding that this is what it ought to do with the output of the subquery in the first example. I've tried everything I can think of, but this may simply not be expressible in SQL at all. Also: I haven't quite proven this yet, but in some cases I fear that the 10 element subquery output could blow up to a few thousand elements (T1 will grow more or less without bound, easily to millions). I've manually tested with a few hundred elements in the splatted-out <code>IN</code> clause and it seems to perform acceptably, but I'm a little concerned it could get out of hand. Note that I also tried a join on the subquery, like so: <pre class="prettyprint"><code>select t0.other from T0 t0 join ( -- Yes, this could be a simple join rather than a subquery, but in practice it's complex -- enough that it can't be expressed that way. select t1.other from T1 t1 where t1.key = 42 ) sub on sub.other = t0.other </code></pre> But it did something truly horrifying in the query planner, that I won't even try to explain here.

Does your actual subquery in the <code>IN</code> clause use any variables from <code>T0</code>? If not, what happens if you try your join query with the tables reordered (and a distinct added for correctness, unless you know that the values will be distinct)? <pre class="prettyprint"><code>SELECT t0.other FROM ( -- Yes, this could be a simple join rather than a subquery, but in practice it's complex -- enough that it can't be expressed that way. SELECT DISTINCT t1.other FROM T1 t1 WHERE t1.key = 42 ) sub JOIN T0 t0 ON sub.other = t0.other </code></pre>

Avoiding hash-join with IN clause and sub-query in Spanner

I have the following query optimization problem in Spanner, and hoping there's a trick I'm missing that will help me bend the query planner to my will.

Here's the simplified schema:

create table T0 (
  key0  int64 not null,
  value int64,
  other int64 not null,
) primary key (key0);

create table T1 {
  key1  int64 not null,
  other int64 not null
} primary key (key1);

And a query with a subquery in an IN clause:

select value from T0 t0
where t0.other in (
  select t1.other from T1 t1 where t1.key1 in (42, 43, 44)  -- note: this subquery is a good deal more complex than this
)

Which produces a 10 element set, via a hash join of T0 against the output of the subquery:

Operator                     Rows  Executions
-----------------------      ----- ----------
Serialize Result               10          1
Hash Join                      10          1
  Distributed union         10000          1
    Local distributed union 10000          1
    Table Scan: T0          10000          1
  Distributed cross apply:      5          1
   ...lots moar T1 subquery stuff...

Note that, while the subquery is complex, it actually produces a very small set. Unfortunately, it also scans the entirety of T1 to feed to the hash join, which is very slow.

However, if I take the output of the subquery on T1 and manually shove it into the IN clause:

select value from T0
where other in (5, 6, 7, 8, 9)  -- presume this `IN` clause to be the output of the above subquery

It is dramatically faster, presumably because it just hits T0's index once per entry, not using a hash join on the full contents:

Operator                Rows Executions
----------------------- ---- ----------
Distributed union         10          1
Local distributed union   10          1
Serialize Result          10          1
Filter                    10          1
Index Scan:               10          1

I could simply run two queries, and that's my best plan so far. But I'm hoping I can find some way to cajole Spanner into deciding that this is what it ought to do with the output of the subquery in the first example. I've tried everything I can think of, but this may simply not be expressible in SQL at all.

Also: I haven't quite proven this yet, but in some cases I fear that the 10 element subquery output could blow up to a few thousand elements (T1 will grow more or less without bound, easily to millions). I've manually tested with a few hundred elements in the splatted-out IN clause and it seems to perform acceptably, but I'm a little concerned it could get out of hand.

Note that I also tried a join on the subquery, like so:

select t0.other from T0 t0
join (
  -- Yes, this could be a simple join rather than a subquery, but in practice it's complex
  -- enough that it can't be expressed that way.
  select t1.other from T1 t1 where t1.key = 42
) sub on sub.other = t0.other

But it did something truly horrifying in the query planner, that I won't even try to explain here.

How does Cloud Spanner handle SQL statements?

As described in Query execution plans, Cloud Spanner's SQL compiler transforms a SQL statement into a query execution plan, which is used to obtain the results of the query. This page describes best practices for constructing SQL statements to help Cloud Spanner find efficient execution plans.

How do I Make my Cloud Spanner queries more efficient?

When you use SQL to query multiple tables, Cloud Spanner automatically uses a join method that is likely to make the query more efficient. However, Google advises you to test with different join algorithms. Choosing the right join algorithm can improve latency, memory consumption, or both.

Should you avoid joins on large tables?

Tl;dr: Avoid joins on large tables and evaluate parts of queries beforehand to get 100–10,000x performance gains! As mentioned in a previous post, because of some of our tables growing in size, our queries started performing poorly which resulted in a performance hit to our most used APIs.

How can I avoid Cloud Spanner from fetching rows from the base table?

If you want to avoid Cloud Spanner from having to fetch any rows from the base table at all, you can optionally store a copy of the FirstName column in the index itself: Using a STORING clause like this costs extra storage but it provides the following advantages for queries and read calls using the index:

Does your actual subquery in the IN clause use any variables from T0? If not, what happens if you try your join query with the tables reordered (and a distinct added for correctness, unless you know that the values will be distinct)?

SELECT t0.other FROM  (
      -- Yes, this could be a simple join rather than a subquery, but in practice it's complex
      -- enough that it can't be expressed that way.
      SELECT DISTINCT t1.other FROM T1 t1 WHERE t1.key = 42
    ) sub 
JOIN T0 t0
ON sub.other = t0.other

Avoiding hash-join with IN clause and sub-query in Spanner

Tags:

sql

google-cloud-platform

google-cloud-spanner

Joel Webber

People also ask

1 Answers

Mike Curtiss

Recent Activity

Donate For Us

Avoiding hash-join with IN clause and sub-query in Spanner

Tags:

sql

google-cloud-platform

google-cloud-spanner

Joel Webber

People also ask

1 Answers

Mike Curtiss

Related questions

Recent Activity

Donate For Us