Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using SELECT...GROUP BY...HAVING in SQLite

Tags:

sql

sqlite

I'm working on exercise 17 in the Teach Yourself SQL program GalaXQL (based on SQLite). I've got three tables:

  • Stars that contains starid;
  • Planets that contains planetid and starid;
  • Moons that contains moonid and planetid.

I want to return the starid associated with the greatest number of planets and moons combined.

I've got a query that will return the starid, planetid and total planets + moons.

How do I change this query so it only returns the single starid corresponding to the max(total) and not a table? This is my query so far:

select
    stars.starid as sid,
    planets.planetid as pid,
    (count(moons.moonid)+count(planets.planetid)) as total
from stars, planets, moons
where planets.planetid=moons.planetid and stars.starid=planets.starid
group by stars.starid
like image 512
ant bait Avatar asked Nov 28 '22 03:11

ant bait


1 Answers

Let's visualize a system that might be represented by this database structure, and see if we can't translate your question into working SQL.

I drew you a galaxy:

badly drawn planetary systems

To distinguish stars and planets from moons, I've used capital Roman numerals for starid values and lower-case Roman numerals for moonid values. And since everyone knows that astronomers have nothing to do on those long nights in the observatory but drink, I put an unexplained gap in the middle of your planetid values. Gaps like these will occur when using so-called "surrogate" IDs, because their values hold no meaning; they are simply unique identifiers for rows.

If you'd like to follow along, here's the galaxy naively loaded into SQL Fiddle (if you get a popup about switching to WebSQL, you may need to hit "cancel" and stick with SQL.js for this example to work).

Let's see, what was it you wanted again?

I want to return the starid associated with the greatest number of planets and moons combined

Awesome. Rephrased, the question is: Which star is associated with the greatest number of orbiting bodies?

  • Star (I) has 1 planet with 3 moons;
  • Star (II) has 1 planet with 1 moon and 1 planet with 2 moons;
  • Star (III) has 1 planet with 1 moon and 2 planets with no moons.

All we're doing here is counting the different entities associated with each star. With a total of 5 orbiting bodies, star (II) is the winner! So the final result we expect from a working query is:

| starid |
|--------|
| 2      |

I intentionally drew this awesome galaxy such that the "winning" star doesn't have the most planets and isn't associated with the planet that has the most moons. If those astronomers weren't all three sheets to the wind, I might have gotten an extra moon out of planet (1) as well, so that our winning star isn't tied for most moons total. It'll be convenient for us in this demonstration if star (II) only answers the question we're asking and not any other questions with potentially similar queries, to reduce our chances of arriving at the right answer via the wrong query.

Lost in translation

The first thing I want to do is introduce you to the explicit JOIN syntax. This is going to be your very close friend. You will always JOIN your tables, no matter what some silly tutorial says. Trust in my far sillier advice instead (and optionally, read Explicit vs implicit SQL joins).

The explicit JOIN syntax shows how we're requiring our tables to relate to each other and reserves the WHERE clause for the sole purpose of filtering rows from the result set. There are a few different types, but what we're going to start with is a plain old INNER JOIN. This is essentially what your original query performed and it implies that all you want to see in your result set is the data that overlaps in all three tables. Check out a skeleton of your original query:

SELECT ... FROM stars, planets, moons
WHERE planets.planetid = moons.planetid 
    AND planets.starid = stars.starid;

Given those conditions, what happens to an orphaned planet somewhere off in space that isn't associated with a star (i.e., its starid is NULL)? Since an orphaned planet has no overlap with the stars table, an INNER JOIN wouldn't include it in the result set.

In SQL any equality or inequality comparison with NULL gives a result of NULL—even NULL = NULL isn't true! Now your query has a problem, because the other condition is that planets.planetid = moons.planetid. If there's a planet for which no corresponding moon exists, that turns into planets.planetid = NULL and the planet will not appear in your query result. That's no good! Lonely planets must be counted!

The OUTER limits

Fortunately there's a JOIN for you: An OUTER JOIN, which will ensure that at least one of the tables always shows up in our result set. They come in LEFT and RIGHT flavors to indicate which table gets special treatment, relative to the position of the JOIN keyword. What joins does SQLite support? confirms that the INNER and OUTER keywords are optional, so we can use LEFT JOIN, noting that:

  • stars and planets are linked by a common starid;
  • planets and moons are linked by a common planetid;
  • stars and moons are linked indirectly by the above two links;
  • we always want to count all the planets and all the moons.
SELECT
    *
FROM
    stars
        LEFT JOIN
    planets ON stars.starid = planets.starid
        LEFT JOIN
    moons ON planets.planetid = moons.planetid;

Notice that instead of having a big bag o' tables and a WHERE clause, you now have one ON clause for each JOIN. As you find yourself working with more tables, this is going to be far easier to read; and because this is standard syntax, it's relatively portable between SQL databases.

Lost in space

Our new query basically grabs everything in our database. But does this correspond to everything in our galaxy? Actually, there's some redundancy here, because two of our ID fields (starid and planetid) exist in more than one table. This is only one of many reasons to avoid the SELECT * catch-all syntax in actual use cases. We only really need the three ID fields, and I'm going to throw in two more tricks while we're at it:

  1. Aliases! You can give your tables more convenient names by using the table_name AS alias syntax. This can be really convenient when you have to refer to many different columns in a multi-table query and you don't want to type out the full table names each time.
  2. Grab starid from the planets table and leave stars out of the JOIN entirely! Having stars LEFT JOIN planets ON stars.starid = planets.starid means that the starid field is going to be the same, no matter which table we get it from—as long as the star has any planets. If we were counting stars, we'd need this table, but we're counting planets and moons; moons by definition orbit planets, so a star with no planets also has no moons and can be ignored. (This is an assumption; check your data to make sure it's justified! Maybe your astronomers are more drunk than usual!)
SELECT
    p.starid,    -- This could be S.starid, if we kept using `stars`
    p.planetid,
    m.moonid
FROM
    planets AS p
        LEFT JOIN
    moons AS m ON p.planetid = m.planetid;

Result:

| starid | planetid | moonid |
|--------|----------|--------|
|      1 |        1 |      1 |
|      1 |        1 |      2 |
|      1 |        1 |      3 |
|      2 |        2 |      6 |
|      2 |        3 |      4 |
|      2 |        3 |      5 |
|      3 |        7 |        |
|      3 |        8 |      7 |
|      3 |        9 |        |

Mathematical!

Now our task is to decide which star is the winner, and for that we have to do some simple calculation. Let's count moons first; since they have no "children" and only one "parent" each, they're easy to aggregate:

SELECT
    p.starid,
    p.planetid,
    COUNT(m.moonid) AS moon_count
FROM
    planets AS p
        LEFT JOIN
    moons AS m ON p.planetid = m.planetid
GROUP BY p.starid, p.planetid;

Result:

| starid | planetid | moon_count |
|--------|----------|------------|
|      1 |        1 |          3 |
|      2 |        2 |          1 |
|      2 |        3 |          2 |
|      3 |        7 |          0 |
|      3 |        8 |          1 |
|      3 |        9 |          0 |

(Note: Usually we like to use COUNT(*) because it's simple to type and to read, but it would get us into trouble here! Since two of our rows have a NULL value for the moonid, we have to use COUNT(moonid) to avoid counting moons that don't exist.)

So far, so good—I see six planets, we know which star each belongs to, and the right number of moons are shown for each planet. Next step, counting the planets. You might think this requires a subquery in order to also add up the moon_count column for each planet but it's actually simpler than that; if we GROUP BY the star, our moon_count will switch from counting "moons per planet, per star" to "moons per star" which is just fine:

SELECT
    p.starid,
    COUNT(p.planetid) AS planet_count,
    COUNT(m.moonid) AS moon_count
FROM
    planets AS p
        LEFT JOIN
    moons AS m ON p.planetid = m.planetid
GROUP BY p.starid;

Result:

| starid | planet_count | moon_count |
|--------|--------------|------------|
|      1 |            3 |          3 |
|      2 |            3 |          3 |
|      3 |            3 |          1 |

Now we've run into trouble. The moon_count is correct, but you should see right away that the planet_count is wrong. Why is this? Look back at the ungrouped query result and notice that there are nine rows, with three rows for each starid, and each row has a non-null value for planetid. That's what we asked the database to count with this query, when what we really meant to ask was how many different planets are there? Planet (1) appears three times with star (I) but it's the same planet each time. The fix is to stick the DISTINCT keyword inside the COUNT() function call. At the same time, we can add the two columns together:

SELECT
    p.starid,
    COUNT(DISTINCT p.planetid)+ COUNT(m.moonid) AS total_bodies
FROM
    planets AS p
        LEFT JOIN
    moons AS m ON p.planetid = m.planetid
GROUP BY p.starid;

Result:

| starid | total_bodies |
|--------|--------------|
|      1 |            4 |
|      2 |            5 |
|      3 |            4 |

And the winner is...

Counting the orbiting bodies around each star in the drawing, we can see that the total_bodies column is correct. But you didn't ask for all this information; you just want to know who won. Well, there are a bunch of ways to get there, and depending on the size and makeup of your galaxy (database), some may be more efficient than others. One approach is to ORDER BY the total_bodies expression so that the "winner" appears at the top, LIMIT 1 so that we don't see the losers, and select only the starid column (see it on SQL Fiddle).

The problem with that approach is that it hides ties. What if we gave the losing stars in our galaxy each an extra planet or moon? Now we've got a three way tie—everyone's a winner! But who shows up first when we ORDER BY a value that's always the same? In the SQL standard, this is undefined; there's no telling who will come out on top. You might run the same query twice on the same data and get two different results!

For this reason, you might prefer to ask which stars have the greatest number of orbital bodies, instead of specifying in your question that you know there is only one value. This is a more typically set-based approach and it's not a bad idea to get used to set-based thinking when working with relational databases. Until you execute the query, you don't know the size of the result set; if you're going to assume there's not a tie for first place, you have to justify that assumption somehow. (Since astronomers regularly find new moons and planets, I'd have a hard time justifying this one!)

The way I'd prefer to write this query is with something called a Common Table Expression (CTE). These are supported in recent versions of SQLite and in many other databases but last I checked GalaXQL was using an older version of the SQLite engine that doesn't include this feature. CTEs let you refer to a subquery multiple times using an alias, rather than having to write it out in full each time. A solution using CTEs could look like this:

WITH body_counts AS
    (SELECT
        p.starid,
        COUNT(DISTINCT p.planetid) + COUNT(m.moonid) AS total_bodies
    FROM
        planets AS p
            LEFT JOIN
        moons AS m ON p.planetid = m.planetid
    GROUP BY p.starid)
SELECT
    starid
FROM
    body_counts
WHERE
    total_bodies = (SELECT MAX(total_bodies) FROM body_counts);

Result:

| STARID |
|--------|
|      2 |

Check out this query in action on SQLFiddle. To confirm that this query can show more than one row in the case of a tie, try changing MAX() on the last line to MIN().

Just for you

Doing this without CTEs is ugly but it can be done if the table size is manageable. Looking at the query above, our CTE is aliased as body_counts and we refer to it twice—in the FROM clause and in the WHERE clause. We can replace both of those references with the statement that we used to define body_counts (removing the id column once in the second subquery, where it's not used):

SELECT
    starid
FROM
    (SELECT
        p.starid,
        COUNT(DISTINCT p.planetid) + COUNT(m.moonid) AS total_bodies
    FROM
        planets AS p
            LEFT JOIN
        moons AS m ON p.planetid = m.planetid
    GROUP BY p.starid)
WHERE
    total_bodies = (SELECT MAX(total_bodies) FROM 
        (SELECT
            COUNT(DISTINCT p.planetid)+ COUNT(m.moonid) AS total_bodies
        FROM
            planets AS p
                LEFT JOIN
            moons AS m ON p.planetid = m.planetid
        GROUP BY p.starid)
    );

This is the tie-friendly approach that should work for you in GalaXQL. See it working here in SQLFiddle.

Now that you've seen both, isn't the CTE version easier to understand? MySQL, which didn't support CTEs until the 2018 release of version 8.0, would additionally demand aliases for our subqueries. Fortunately, SQLite does not, because in this case it's just extra verbiage to add to an already over-complicated query.

Well, that was fun—are you sorry you asked? ;)

(P.S., if you were wondering what's up with planet number nine: giant space potato chips tend to have very eccentric orbits.)

like image 71
Air Avatar answered Dec 31 '22 10:12

Air