Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Procedurally transform subquery into join

Is there a generalized procedure or algorithm for transforming a SQL subquery into a join, or vice versa? That is, is there a set of typographic operations that can be applied to a syntactically correct SQL query statement containing a subquery that results in a functionally equivalent statement without a subquery? If so, what are they (i.e., what's the algorithm), and in what cases do they not apply?

like image 966
Justin R. Avatar asked Nov 20 '09 19:11

Justin R.


5 Answers

Converting a subquery into a JOIN can be pretty straightforward:

IN clause

 FROM TABLE_X x
WHERE x.col IN (SELECT y.col FROM TABLE_Y y)

...can be converted to:

FROM TABLE_X x
JOIN TABLE_Y y ON y.col = x.col

Your JOIN criteria is where you have direct comparison.

EXISTS clause

But there are complications when you look at the EXISTS clause. EXISTS are typically correllated, where the subquery is filtered by criteria from the table(s) outside the subquery. But the EXISTS is only for returning a boolean based on the criteria.

 FROM TABLE_X x
WHERE EXISTS (SELECT NULL
                FROM TABLE_Y y
               WHERE y.col = x.col)

...converted:

FROM TABLE_X x
JOIN TABLE_Y y ON y.col = x.col

Because of the boolean, there's a risk of more rows turning up in the resultset.

SELECTs in the SELECT clause

These should always be changed, with prejudice:

SELECT x.*,
       (SELECT MAX(y.example_col)
          FROM TABLE_Y y
         WHERE y.col = x.col)
  FROM TABLE_X x

You're probably noticing a patter now, but I made this a little different for an inline view example:

SELECT x.*,
       z.mc
  FROM TABLE_X x
  JOIN (SELECT y.col, --inline view within the brackets
               MAX(y.example_col) 'mc'
          FROM TABLE_Y y
      GROUP BY y.col) z ON z.col = x.col

The key is making sure the inline view resultset includes the column(s) needed to join to, along with the columns.

LEFT JOINs

You might've noticed I didn't have any LEFT JOIN examples - this would only be necessary if columns from the subquery use NULL testing (COALESCE on almost any db these days, Oracle's NVL or NVL2, MySQLs IFNULL, SQL Server's ISNULL, etc...):

SELECT x.*,
       COALESCE((SELECT MAX(y.example_col)
          FROM TABLE_Y y
         WHERE y.col = x.col), 0)
  FROM TABLE_X x

Converted:

   SELECT x.*,
          COALESCE(z.mc, 0)
     FROM TABLE_X x
LEFT JOIN (SELECT y.col,
                  MAX(y.example_col) 'mc'
             FROM TABLE_Y y
         GROUP BY y.col) z ON z.col = x.col

Conclusion

I'm not sure if that will satisfy your typographic needs, but hope I've demonstrated that the key is determining what the JOIN criteria is. Once you know the column(s) involved, you know the table(s) involved.

like image 200
OMG Ponies Avatar answered Oct 18 '22 02:10

OMG Ponies


This question relies on a basic knowledge of Relational Algebra. You need to ask yourself what kind of join is being performed. For example, an LEFT ANTI SEMI JOIN is like a WHERE NOT EXISTS clause.

Some joins do not allow duplicating of data, some do not allow eliminating data. Others allow extra fields to be available. I discuss this in my blog at http://msmvps.com/blogs/robfarley/archive/2008/11/09/join-simplification-in-sql-server.aspx

Also, please don't feel you need to do everything in JOINs. The Query Optimizer should take care of all of this for you, and you can often make your queries much harder to maintain this way. You may find yourself using an extensive GROUP BY clause, and having interesting WHERE .. IS NULL filters, which will only serve to disconnect the business logic from the query design.

A subquery in the SELECT clause (essentially a lookup) only provides an extra field, not duplication or elimination. Therefore, you would need to make sure that you enforce GROUP BY or DISTINCT values in your JOIN, and use an OUTER JOIN to guarantee that behaviour is the same.

A subquery in the WHERE clause can never duplicate data, or provide extra columns to the SELECT clause, so you should use GROUP BY / DISTINCT to check this. WHERE EXISTS is similar. (This the LEFT SEMI JOIN)

WHERE NOT EXISTS (LEFT ANTI SEMI JOIN) doesn't provide data, and doesn't duplicate rows, but can eliminate... for this you need to do LEFT JOINs and look for NULLs.

But the Query Optimizer should handle all this for you. I actually like having occasional subqueries in the SELECT clause, because it makes it very clear that I am not duplicating or eliminating rows. The QO can tidy it for me, but if I'm using a view or inline table-valued function, I want to make it clear to those who come after me that the QO can simplify it down a lot. Have a look at the Execution Plans of your original query, and you'll see that the system is providing the INNER/OUTER/SEMI joins for you.

The thing you really need to be avoiding (at least in SQL Server) are functions that use BEGIN and END (such as Scalar Functions). They may feel like they simplify your code, but they will actually be executed in a separate context, as the system sees them as procedural (not simplifiable).

I did a session on this kind of thing at the recent SQLBits V conference. It was recorded, so you should be able to watch it at some point (if you can put up with my jokes!)

like image 25
Rob Farley Avatar answered Oct 18 '22 02:10

Rob Farley


It often is possible, and what's good is that the query optimizer can do it automatically, so you don't have to care about it.

like image 40
erikkallen Avatar answered Oct 18 '22 01:10

erikkallen


At a really high level. to transform a sub-query to a JOIN:

  1. FROM: Table Names go into FROM
    • JOIN The parts of the WHERE clause with table names on both sides determine (a) the type of JOIN (b) the condition of join
    • WHERE The parts of the where clause without table names on both sides go into the WHERE clause
    • SELECT Column names from Sub-Query go into the SELECT

Transforming a JOIN to Sub-Query entails the reverse of the above logic

like image 34
Raj More Avatar answered Oct 18 '22 01:10

Raj More


In SQL Server, at least, the optimizer can do this at will, but I'm sure that there are constraints on when it does it. I'm sure that it was probably someone's PhD thesis to be able to do it in the computer.

When I do it the old fashioned human way, it's fairly straightforward - particularly if the subquery is already aliased - it can be pulled into a Common Table Expression first.

like image 38
Cade Roux Avatar answered Oct 18 '22 03:10

Cade Roux