More of a theory/logic question but what I have is two tables: links
and options
. Links is a table where I add rows that represent a link between a product ID (in a separate products
table) and an option. The options
table holds all available options.
What I'm trying to do (but struggling to create the logic for) is to join the two tables, returning only the rows where there is no option link in the links
table, therefore representing which options are still available to add to the product.
Is there a feature of SQL that might help me here? I'm not tremendously experienced with SQL yet.
Your table design sounds fine.
If this query returns the id
values of the "options" linked to a particular "product"...
SELECT k.option_id
FROM links k
WHERE k.product_id = 'foo'
Then this query would get the details of all the options related to the "product"
SELECT o.id
, o.name
FROM options o
JOIN links k
ON k.option_id = o.id
WHERE k.product_id = 'foo'
Note that we can actually move the "product_id='foo'"
predicate from the WHERE clause to the ON clause of the JOIN, for an equivalent result, e.g.
SELECT o.id
, o.name
FROM options o
JOIN links k
ON k.option_id = o.id
AND k.product_id = 'foo'
(Not that it makes any difference here, but it would make a difference if we were using an OUTER JOIN (in the WHERE clause, it would negate the "outer-ness" of the join, and make it equivalent to an INNER JOIN.)
But, none of that answers your question, it only sets the stage for answering your question:
How do we get the rows from "options" that are NOT linked to particular product?
The most efficient approach is (usually) an anti-join pattern.
What that is, we will get all the rows from "options", along with any matching rows from "links" (for a particular product_id, in your case). That result set will include the rows from "options" that don't have a matching row in "links".
The "trick" is to filter out all the rows that had matching row(s) found in "links". That will leave us with only the rows that didn't have a match.
And way we filter those rows, we use a predicate in the WHERE clause that checks whether a match was found. We do that by checking a column that we know for certain will be NOT NULL if a matching row was found. And we know* for certain that column will be NULL if a matching row was NOT found.
Something like this:
SELECT o.id
, o.name
FROM options o
LEFT
JOIN links k
ON k.option_id = o.id
AND k.product_id = 'foo'
WHERE k.option_id IS NULL
The "LEFT"
keyword specifies an "outer" join operation, we get all the rows from "options" (the table on the "left" side of the JOIN) even if a matching row is not found. (A normal inner join would filter out rows that didn't have a match.)
The "trick" is in the WHERE clause... if we found a matching row from links, we know that the "option_id"
column returned from "links"
would not be NULL. It can't be NULL if it "equals" something, and we know it had to "equals" something because of the predicate in the ON clause.
So, we know that the rows from options that didn't have a match will have a NULL value for that column.
It takes a bit to get your brain wrapped around it, but the anti-join quickly becomes a familiar pattern.
The "anti-join" pattern isn't the only way to get the result set. There are a couple of other approaches.
One option is to use a query with a "NOT EXISTS"
predicate with a correlated subquery. This is somewhat easier to understand, but doesn't usually perform as well:
SELECT o.id
, o.name
FROM options o
WHERE NOT EXISTS ( SELECT 1
FROM links k
WHERE k.option_id = o.id
AND k.product_id = 'foo'
)
That says get me all rows from the options table. But for each row, run a query, and see if a matching row "exists" in the links table. (It doesn't matter what is returned in the select list, we're only testing whether it returns at least one row... I use a "1" in the select list to remind me I'm looking for "1 row".
This usually doesn't perform as well as the anti-join, but sometimes it does run faster, especially if other predicates in the WHERE clause of the outer query filter out nearly every row, and the subquery only has to run for a couple of rows. (That is, when we only have to check a few needles in a haystack. When we need to process the whole stack of hay, the anti-join pattern is usually faster.)
And the beginner query you're most likely to see is a NOT IN (subquery)
. I'm not even going to give an example of that. If you've got a list of literals, then by all means, use a NOT IN. But with a subquery, it's rarely the best performer, though it does seem to be the easiest to understand.
Oh, what the hay, I'll give a demo of that as well (not that I'm encouraging you to do it this way):
SELECT o.id
, o.name
FROM options o
WHERE o.id NOT IN ( SELECT k.option_id
FROM links k
WHERE k.product_id = 'foo'
AND k.option_id IS NOT NULL
GROUP BY k.option_id
)
That subquery (inside the parens) gets a list of all the option_id values associated with a product.
Now, for each row in options (in the outer query), we can check the id value to see if it's in that list returned by the subquery.
If we have a guarantee that option_id will never be NULL, we can omit the predicate that tests for "option_id IS NOT NULL"
. (In the more general case, when a NULL creeps into the resultset, then the outer query can't tell if o.id is in the list or not, and the query doesn't return any rows; so I usually include that, even when it's not required. The GROUP BY
isn't strictly necessary either; especially if there's a unique constraint (guaranteed uniqueness) on the (product_id,option_id) tuple.
But, again, don't use that NOT IN (subquery)
, except for testing, unless there's some compelling reason to (for example, it manages to perform better than the anti-join.)
You're unlikely to notice any performance differences with small sets, the overhead of transmitting the statement, parsing it, generating an access plan, and returning results dwarfs the actual "execution" time of the plan. It's with larger sets that the differences in "execution" time become apparent.
EXPLAIN SELECT ...
is a really good way to get a handle on the execution plans, to see what MySQL is really doing with your statement.
Appropriate indexes, especially covering indexes, can noticeably improve performance of some statements.
Yes, you can do a LEFT JOIN
(if MySQL; there are variations in other dialects) which will include rows in links which do NOT have a match in options. Then test if options.someColumn
IS NULL
and you will have exactly the rows in links which had no "matching" row in options.
Try something along the lines of this
To count
SELECT Links.linkId, Count(*)
FROM Link
LEFT JOIN Options ON Links.optionId = Options.optionId
Where Options.optionId IS NULL
Group by Links.linkId
To see the lines
SELECT Links.linkId
FROM Link
LEFT JOIN Options ON Links.optionId = Options.optionId
Where Options.optionId IS NULL
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With