Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JOIN on another table after GROUP BY and COUNT

I'm trying to make sense of the right way to use JOIN, COUNT(*), and GROUP BY to do a pretty simple query. I've actually gotten it to work (see below) but from what I've read, I'm using an extra GROUP BY that I shouldn't be.

(Note: The problem below isn't my actual problem (which deals with more complicated tables), but I've tried to come up with an analogous problem)

I have two tables:

Table: Person
-------------
key  name     cityKey
1    Alice    1
2    Bob      2
3    Charles  2
4    David    1

Table: City
-------------
key  name
1    Albany
2    Berkeley
3    Chico

I'd like to do a query on the People (with some WHERE clause) that returns

  • the number of matching people in each city
  • the key for the city
  • the name of the city.

If I do

SELECT COUNT(Person.key) AS count, City.key AS cityKey, City.name AS cityName
FROM Person 
LEFT JOIN City ON Person.cityKey = City.key 
GROUP BY Person.cityKey, City.name

I get the result that I want

count   cityKey   cityName
2       1         Albany
2       2         Berkeley

However, I've read that throwing in that last part of the GROUP BY clause (City.name) just to make it work is wrong.

So what's the right way to do this? I've been trying to google for an answer, but I feel like there's something fundamental that I'm just not getting.

like image 470
Craig S Avatar asked Jun 20 '10 16:06

Craig S


4 Answers

I don't think that it's "wrong" in this case, because you've got a one-to-one relationship between city name and city key. You could rewrite it such that you join to a sub-select to get the count of persons to cities by key, to the city table again for the name, but it's debatable that that'd be better. It's a matter of style and opinion I guess.

select PC.ct, City.key, City.name
  from City
  join (select count(Person.key) ct, cityKey key from Person group by cityKey) PC
    on City.key = PC.key

if my SQL isn't too rusty :-)

like image 118
Pointy Avatar answered Nov 17 '22 00:11

Pointy


...I've read that throwing in that last part of the GROUP BY clause (City.name) just to make it work is wrong.

You misunderstand, you got it backwards.
Standard SQL requires you to specify in the GROUP BY all the columns mentioned in the SELECT that are not wrapped in aggregate functions. If you don't want certain columns in the GROUP BY, wrap them in aggregate functions. Depending on the database, you could use the analytic/windowing function OVER...

However, MySQL and SQLite provide the "feature" where you can omit these columns from the group by - which leads to no end of "why doesn't this port from MySQL to fill_in_the_blank database?!" Stackoverflow and numerous other sites & forums.

like image 40
OMG Ponies Avatar answered Nov 17 '22 01:11

OMG Ponies


However, I've read that throwing in that last part of the GROUP BY clause (City.name) just to make it work is wrong.

It's not wrong. You have to understand how the Query Optimizer sees your query. The order in which it is parsed is what requires you to "throw the last part in." The optimizer sees your query in something akin to this order:

  • the required tables are joined
  • the composite dataset is filtered through the WHERE clause
  • the remaining rows are chopped into groups by the GROUP BY clause, and aggregated
  • they are then filtered again, through the HAVING clause
  • finally operated on, by SELECT / ORDER BY, UPDATE or DELETE.

The point here is that it's not that the GROUP BY has to name all the columns in the SELECT, but in fact it is the opposite - the SELECT cannot include any columns not already in the GROUP BY.

like image 43
ahsteele Avatar answered Nov 17 '22 00:11

ahsteele


Your query would only work on MySQL, because you group on Person.cityKey but select city.key. All other databases would require you to use an aggregate like min(city.key), or to add City.key to the group by clause.

Because the combination of city name and city key is unique, the following are equivalent:

select    count(person.key), min(city.key), min(city.name)
...
group by  person.citykey

Or:

select    count(person.key), city.key, city.name
...
group by  person.citykey, city.key, city.name

Or:

select    count(person.key), city.key, max(city.name)
...
group by  city.key

All rows in the group will have the same city name and key, so it doesn't matter if you use the max or min aggregate.

P.S. If you'd like to count only different persons, even if they have multiple rows, try:

count(DISTINCT person.key)

instead of

count(person.key)
like image 1
Andomar Avatar answered Nov 17 '22 01:11

Andomar