I'm working on exercise 17 in the Teach Yourself SQL program GalaXQL (based on SQLite). I've got three tables:
Stars
that contains starid
;Planets
that contains planetid
and starid
;Moons
that contains moonid
and planetid
.I want to return the starid
associated with the greatest number of planets and moons combined.
I've got a query that will return the starid
, planetid
and total
planets + moons.
How do I change this query so it only returns the single starid
corresponding to the max(total)
and not a table? This is my query so far:
select
stars.starid as sid,
planets.planetid as pid,
(count(moons.moonid)+count(planets.planetid)) as total
from stars, planets, moons
where planets.planetid=moons.planetid and stars.starid=planets.starid
group by stars.starid
Let's visualize a system that might be represented by this database structure, and see if we can't translate your question into working SQL.
To distinguish stars and planets from moons, I've used capital Roman numerals for starid
values and lower-case Roman numerals for moonid
values. And since everyone knows that astronomers have nothing to do on those long nights in the observatory but drink, I put an unexplained gap in the middle of your planetid
values. Gaps like these will occur when using so-called "surrogate" IDs, because their values hold no meaning; they are simply unique identifiers for rows.
If you'd like to follow along, here's the galaxy naively loaded into SQL Fiddle (if you get a popup about switching to WebSQL, you may need to hit "cancel" and stick with SQL.js for this example to work).
I want to return the
starid
associated with the greatest number of planets and moons combined
Awesome. Rephrased, the question is: Which star is associated with the greatest number of orbiting bodies?
All we're doing here is counting the different entities associated with each star. With a total of 5 orbiting bodies, star (II) is the winner! So the final result we expect from a working query is:
| starid |
|--------|
| 2 |
I intentionally drew this awesome galaxy such that the "winning" star doesn't have the most planets and isn't associated with the planet that has the most moons. If those astronomers weren't all three sheets to the wind, I might have gotten an extra moon out of planet (1) as well, so that our winning star isn't tied for most moons total. It'll be convenient for us in this demonstration if star (II) only answers the question we're asking and not any other questions with potentially similar queries, to reduce our chances of arriving at the right answer via the wrong query.
The first thing I want to do is introduce you to the explicit JOIN
syntax. This is going to be your very close friend. You will always JOIN
your tables, no matter what some silly tutorial says. Trust in my far sillier advice instead (and optionally, read Explicit vs implicit SQL joins).
The explicit JOIN
syntax shows how we're requiring our tables to relate to each other and reserves the WHERE
clause for the sole purpose of filtering rows from the result set. There are a few different types, but what we're going to start with is a plain old INNER JOIN
. This is essentially what your original query performed and it implies that all you want to see in your result set is the data that overlaps in all three tables. Check out a skeleton of your original query:
SELECT ... FROM stars, planets, moons
WHERE planets.planetid = moons.planetid
AND planets.starid = stars.starid;
Given those conditions, what happens to an orphaned planet somewhere off in space that isn't associated with a star (i.e., its starid
is NULL
)? Since an orphaned planet has no overlap with the stars
table, an INNER JOIN
wouldn't include it in the result set.
In SQL any equality or inequality comparison with NULL
gives a result of NULL
—even NULL = NULL
isn't true! Now your query has a problem, because the other condition is that planets.planetid = moons.planetid
. If there's a planet for which no corresponding moon exists, that turns into planets.planetid = NULL
and the planet will not appear in your query result. That's no good! Lonely planets must be counted!
OUTER
limitsFortunately there's a JOIN
for you: An OUTER JOIN
, which will ensure that at least one of the tables always shows up in our result set. They come in LEFT
and RIGHT
flavors to indicate which table gets special treatment, relative to the position of the JOIN
keyword. What joins does SQLite support? confirms that the INNER
and OUTER
keywords are optional, so we can use LEFT JOIN
, noting that:
stars
and planets
are linked by a common starid
;planets
and moons
are linked by a common planetid
;stars
and moons
are linked indirectly by the above two links;SELECT
*
FROM
stars
LEFT JOIN
planets ON stars.starid = planets.starid
LEFT JOIN
moons ON planets.planetid = moons.planetid;
Notice that instead of having a big bag o' tables and a WHERE
clause, you now have one ON
clause for each JOIN
. As you find yourself working with more tables, this is going to be far easier to read; and because this is standard syntax, it's relatively portable between SQL databases.
Our new query basically grabs everything in our database. But does this correspond to everything in our galaxy? Actually, there's some redundancy here, because two of our ID fields (starid
and planetid
) exist in more than one table. This is only one of many reasons to avoid the SELECT *
catch-all syntax in actual use cases. We only really need the three ID fields, and I'm going to throw in two more tricks while we're at it:
table_name AS alias
syntax. This can be really convenient when you have to refer to many different columns in a multi-table query and you don't want to type out the full table names each time.starid
from the planets
table and leave stars
out of the JOIN
entirely! Having stars LEFT JOIN planets ON stars.starid = planets.starid
means that the starid
field is going to be the same, no matter which table we get it from—as long as the star has any planets. If we were counting stars, we'd need this table, but we're counting planets and moons; moons by definition orbit planets, so a star with no planets also has no moons and can be ignored. (This is an assumption; check your data to make sure it's justified! Maybe your astronomers are more drunk than usual!)SELECT
p.starid, -- This could be S.starid, if we kept using `stars`
p.planetid,
m.moonid
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid;
Result:
| starid | planetid | moonid |
|--------|----------|--------|
| 1 | 1 | 1 |
| 1 | 1 | 2 |
| 1 | 1 | 3 |
| 2 | 2 | 6 |
| 2 | 3 | 4 |
| 2 | 3 | 5 |
| 3 | 7 | |
| 3 | 8 | 7 |
| 3 | 9 | |
Now our task is to decide which star is the winner, and for that we have to do some simple calculation. Let's count moons first; since they have no "children" and only one "parent" each, they're easy to aggregate:
SELECT
p.starid,
p.planetid,
COUNT(m.moonid) AS moon_count
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid
GROUP BY p.starid, p.planetid;
Result:
| starid | planetid | moon_count |
|--------|----------|------------|
| 1 | 1 | 3 |
| 2 | 2 | 1 |
| 2 | 3 | 2 |
| 3 | 7 | 0 |
| 3 | 8 | 1 |
| 3 | 9 | 0 |
(Note: Usually we like to use COUNT(*)
because it's simple to type and to read, but it would get us into trouble here! Since two of our rows have a NULL
value for the moonid
, we have to use COUNT(moonid)
to avoid counting moons that don't exist.)
So far, so good—I see six planets, we know which star each belongs to, and the right number of moons are shown for each planet. Next step, counting the planets. You might think this requires a subquery in order to also add up the moon_count
column for each planet but it's actually simpler than that; if we GROUP BY
the star, our moon_count
will switch from counting "moons per planet, per star" to "moons per star" which is just fine:
SELECT
p.starid,
COUNT(p.planetid) AS planet_count,
COUNT(m.moonid) AS moon_count
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid
GROUP BY p.starid;
Result:
| starid | planet_count | moon_count |
|--------|--------------|------------|
| 1 | 3 | 3 |
| 2 | 3 | 3 |
| 3 | 3 | 1 |
Now we've run into trouble. The moon_count
is correct, but you should see right away that the planet_count
is wrong. Why is this? Look back at the ungrouped query result and notice that there are nine rows, with three rows for each starid
, and each row has a non-null value for planetid
. That's what we asked the database to count with this query, when what we really meant to ask was how many different planets are there? Planet (1) appears three times with star (I) but it's the same planet each time. The fix is to stick the DISTINCT
keyword inside the COUNT()
function call. At the same time, we can add the two columns together:
SELECT
p.starid,
COUNT(DISTINCT p.planetid)+ COUNT(m.moonid) AS total_bodies
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid
GROUP BY p.starid;
Result:
| starid | total_bodies |
|--------|--------------|
| 1 | 4 |
| 2 | 5 |
| 3 | 4 |
Counting the orbiting bodies around each star in the drawing, we can see that the total_bodies
column is correct. But you didn't ask for all this information; you just want to know who won. Well, there are a bunch of ways to get there, and depending on the size and makeup of your galaxy (database), some may be more efficient than others. One approach is to ORDER BY
the total_bodies
expression so that the "winner" appears at the top, LIMIT 1
so that we don't see the losers, and select only the starid
column (see it on SQL Fiddle).
The problem with that approach is that it hides ties. What if we gave the losing stars in our galaxy each an extra planet or moon? Now we've got a three way tie—everyone's a winner! But who shows up first when we ORDER BY
a value that's always the same? In the SQL standard, this is undefined; there's no telling who will come out on top. You might run the same query twice on the same data and get two different results!
For this reason, you might prefer to ask which stars have the greatest number of orbital bodies, instead of specifying in your question that you know there is only one value. This is a more typically set-based approach and it's not a bad idea to get used to set-based thinking when working with relational databases. Until you execute the query, you don't know the size of the result set; if you're going to assume there's not a tie for first place, you have to justify that assumption somehow. (Since astronomers regularly find new moons and planets, I'd have a hard time justifying this one!)
The way I'd prefer to write this query is with something called a Common Table Expression (CTE). These are supported in recent versions of SQLite and in many other databases but last I checked GalaXQL was using an older version of the SQLite engine that doesn't include this feature. CTEs let you refer to a subquery multiple times using an alias, rather than having to write it out in full each time. A solution using CTEs could look like this:
WITH body_counts AS
(SELECT
p.starid,
COUNT(DISTINCT p.planetid) + COUNT(m.moonid) AS total_bodies
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid
GROUP BY p.starid)
SELECT
starid
FROM
body_counts
WHERE
total_bodies = (SELECT MAX(total_bodies) FROM body_counts);
Result:
| STARID |
|--------|
| 2 |
Check out this query in action on SQLFiddle. To confirm that this query can show more than one row in the case of a tie, try changing MAX()
on the last line to MIN()
.
Doing this without CTEs is ugly but it can be done if the table size is manageable. Looking at the query above, our CTE is aliased as body_counts
and we refer to it twice—in the FROM
clause and in the WHERE
clause. We can replace both of those references with the statement that we used to define body_counts
(removing the id column once in the second subquery, where it's not used):
SELECT
starid
FROM
(SELECT
p.starid,
COUNT(DISTINCT p.planetid) + COUNT(m.moonid) AS total_bodies
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid
GROUP BY p.starid)
WHERE
total_bodies = (SELECT MAX(total_bodies) FROM
(SELECT
COUNT(DISTINCT p.planetid)+ COUNT(m.moonid) AS total_bodies
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid
GROUP BY p.starid)
);
This is the tie-friendly approach that should work for you in GalaXQL. See it working here in SQLFiddle.
Now that you've seen both, isn't the CTE version easier to understand? MySQL, which didn't support CTEs until the 2018 release of version 8.0, would additionally demand aliases for our subqueries. Fortunately, SQLite does not, because in this case it's just extra verbiage to add to an already over-complicated query.
Well, that was fun—are you sorry you asked? ;)
(P.S., if you were wondering what's up with planet number nine: giant space potato chips tend to have very eccentric orbits.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With