Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL JOIN Query to return rows where we did NOT find a match in joined table

Tags:

sql

join

mysql

More of a theory/logic question but what I have is two tables: links and options. Links is a table where I add rows that represent a link between a product ID (in a separate products table) and an option. The options table holds all available options.

What I'm trying to do (but struggling to create the logic for) is to join the two tables, returning only the rows where there is no option link in the links table, therefore representing which options are still available to add to the product.

Is there a feature of SQL that might help me here? I'm not tremendously experienced with SQL yet.

like image 707
twistedpixel Avatar asked Apr 09 '14 23:04

twistedpixel


3 Answers

Your table design sounds fine.

If this query returns the id values of the "options" linked to a particular "product"...

SELECT k.option_id
  FROM links k
 WHERE k.product_id = 'foo'

Then this query would get the details of all the options related to the "product"

SELECT o.id
     , o.name
  FROM options o
  JOIN links k
    ON k.option_id = o.id
 WHERE k.product_id = 'foo'

Note that we can actually move the "product_id='foo'" predicate from the WHERE clause to the ON clause of the JOIN, for an equivalent result, e.g.

SELECT o.id
     , o.name
  FROM options o
  JOIN links k
    ON k.option_id = o.id
   AND k.product_id = 'foo'

(Not that it makes any difference here, but it would make a difference if we were using an OUTER JOIN (in the WHERE clause, it would negate the "outer-ness" of the join, and make it equivalent to an INNER JOIN.)

But, none of that answers your question, it only sets the stage for answering your question:

How do we get the rows from "options" that are NOT linked to particular product?

The most efficient approach is (usually) an anti-join pattern.

What that is, we will get all the rows from "options", along with any matching rows from "links" (for a particular product_id, in your case). That result set will include the rows from "options" that don't have a matching row in "links".

The "trick" is to filter out all the rows that had matching row(s) found in "links". That will leave us with only the rows that didn't have a match.

And way we filter those rows, we use a predicate in the WHERE clause that checks whether a match was found. We do that by checking a column that we know for certain will be NOT NULL if a matching row was found. And we know* for certain that column will be NULL if a matching row was NOT found.

Something like this:

SELECT o.id
     , o.name
  FROM options o
  LEFT
  JOIN links k
    ON k.option_id = o.id
   AND k.product_id = 'foo'
 WHERE k.option_id IS NULL

The "LEFT" keyword specifies an "outer" join operation, we get all the rows from "options" (the table on the "left" side of the JOIN) even if a matching row is not found. (A normal inner join would filter out rows that didn't have a match.)

The "trick" is in the WHERE clause... if we found a matching row from links, we know that the "option_id" column returned from "links" would not be NULL. It can't be NULL if it "equals" something, and we know it had to "equals" something because of the predicate in the ON clause.

So, we know that the rows from options that didn't have a match will have a NULL value for that column.

It takes a bit to get your brain wrapped around it, but the anti-join quickly becomes a familiar pattern.


The "anti-join" pattern isn't the only way to get the result set. There are a couple of other approaches.

One option is to use a query with a "NOT EXISTS" predicate with a correlated subquery. This is somewhat easier to understand, but doesn't usually perform as well:

SELECT o.id
     , o.name
  FROM options o
 WHERE NOT EXISTS ( SELECT 1
                      FROM links k
                     WHERE k.option_id = o.id
                       AND k.product_id = 'foo'
                  )

That says get me all rows from the options table. But for each row, run a query, and see if a matching row "exists" in the links table. (It doesn't matter what is returned in the select list, we're only testing whether it returns at least one row... I use a "1" in the select list to remind me I'm looking for "1 row".

This usually doesn't perform as well as the anti-join, but sometimes it does run faster, especially if other predicates in the WHERE clause of the outer query filter out nearly every row, and the subquery only has to run for a couple of rows. (That is, when we only have to check a few needles in a haystack. When we need to process the whole stack of hay, the anti-join pattern is usually faster.)

And the beginner query you're most likely to see is a NOT IN (subquery). I'm not even going to give an example of that. If you've got a list of literals, then by all means, use a NOT IN. But with a subquery, it's rarely the best performer, though it does seem to be the easiest to understand.

Oh, what the hay, I'll give a demo of that as well (not that I'm encouraging you to do it this way):

SELECT o.id
     , o.name
  FROM options o
 WHERE o.id NOT IN ( SELECT k.option_id
                       FROM links k
                      WHERE k.product_id = 'foo'
                        AND k.option_id IS NOT NULL
                      GROUP BY k.option_id
                   )

That subquery (inside the parens) gets a list of all the option_id values associated with a product.

Now, for each row in options (in the outer query), we can check the id value to see if it's in that list returned by the subquery.

If we have a guarantee that option_id will never be NULL, we can omit the predicate that tests for "option_id IS NOT NULL". (In the more general case, when a NULL creeps into the resultset, then the outer query can't tell if o.id is in the list or not, and the query doesn't return any rows; so I usually include that, even when it's not required. The GROUP BY isn't strictly necessary either; especially if there's a unique constraint (guaranteed uniqueness) on the (product_id,option_id) tuple.

But, again, don't use that NOT IN (subquery), except for testing, unless there's some compelling reason to (for example, it manages to perform better than the anti-join.)

You're unlikely to notice any performance differences with small sets, the overhead of transmitting the statement, parsing it, generating an access plan, and returning results dwarfs the actual "execution" time of the plan. It's with larger sets that the differences in "execution" time become apparent.

EXPLAIN SELECT ... is a really good way to get a handle on the execution plans, to see what MySQL is really doing with your statement.

Appropriate indexes, especially covering indexes, can noticeably improve performance of some statements.

like image 135
spencer7593 Avatar answered Nov 16 '22 08:11

spencer7593


Yes, you can do a LEFT JOIN (if MySQL; there are variations in other dialects) which will include rows in links which do NOT have a match in options. Then test if options.someColumn IS NULL and you will have exactly the rows in links which had no "matching" row in options.

like image 38
RobP Avatar answered Nov 16 '22 07:11

RobP


Try something along the lines of this

To count

 SELECT Links.linkId, Count(*)
    FROM Link
    LEFT JOIN Options ON Links.optionId = Options.optionId
    Where Options.optionId IS NULL
    Group by Links.linkId

To see the lines

SELECT Links.linkId
    FROM Link
    LEFT JOIN Options ON Links.optionId = Options.optionId
    Where Options.optionId IS NULL
like image 1
Achilles Avatar answered Nov 16 '22 07:11

Achilles