Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL Group By and HAVING

I'm a MySQL query noobie so I'm sure this is a question with an obvious answer.

But, I was looking at these two queries. Will they return different result sets? I understand that the sorting process would commence differently, but I believe they will return the same results with the first query being slightly more efficient?

Query 1: HAVING, then AND

SELECT user_id   
FROM forum_posts  
GROUP BY user_id 
    HAVING COUNT(id) >= 100   
    AND user_id NOT IN (SELECT user_id FROM banned_users)

Query 2: WHERE, then HAVING

SELECT user_id   
FROM forum_posts 
WHERE user_id NOT IN(SELECT user_id FROM banned_users) 
GROUP BY user_id 
    HAVING COUNT(id) >= 100   
like image 732
kimmothy Avatar asked Jun 23 '11 16:06

kimmothy


2 Answers

Actually the first query will be less efficient (HAVING applied after WHERE).
UPDATE

Some pseudo code to illustrate how your queries are executed ([very] simplified version).
First query:
1. SELECT user_id FROM forum_posts
2. SELECT user_id FROM banned_user
3. Group, count, etc.
4. Exclude records from the first result set if they are presented in the second

Second query
1. SELECT user_id FROM forum_posts
2. SELECT user_id FROM banned_user
3. Exclude records from the first result set if they are presented in the second
4. Group, count, etc.

The order of steps 1,2 is not important, mysql can choose whatever it thinks is better. The important difference is in steps 3,4. Having is applied after GROUP BY. Grouping is usually more expensive than joining (excluding records can be considering as join operation in this case), so the fewer records it has to group, the better performance.

like image 66
a1ex07 Avatar answered Sep 23 '22 13:09

a1ex07


You have already answers that the two queries will show same results and various opinions for which one is more efficient.

My opininion is that there will be a difference in efficiency (speed), only if the optimizer yields with different plans for the 2 queries. I think that for the latest MySQL versions the optimizers are smart enough to find the same plan for either query so there will be no difference at all but off course one can test and see either the excution plans with EXPLAIN or running the 2 queries against some test tables.

I would use the second version in any case, just to play safe.


Let me add that:

  • COUNT(*) is usually more efficient than COUNT(notNullableField) in MySQL. Until that is fixed in future MySQL versions, use COUNT(*) where applicable.

Therefore, you can also use:

SELECT user_id   
FROM forum_posts 
WHERE user_id NOT IN
  ( SELECT user_id FROM banned_users ) 
GROUP BY user_id 
HAVING COUNT(*) >= 100   
  • There are also other ways to achieve same (to NOT IN) sub-results before applying GROUP BY.

Using LEFT JOIN / NULL :

SELECT fp.user_id   
FROM forum_posts AS fp
  LEFT JOIN banned_users AS bu
    ON bu.user_id = fp.user_id
WHERE bu.user_id IS NULL 
GROUP BY fp.user_id 
HAVING COUNT(*) >= 100  

Using NOT EXISTS :

SELECT fp.user_id   
FROM forum_posts AS fp 
WHERE NOT EXISTS
  ( SELECT * 
    FROM banned_users AS bu
    WHERE bu.user_id = fp.user_id
  ) 
GROUP BY fp.user_id 
HAVING COUNT(*) >= 100   

Which of the 3 methods is faster depends on your table sizes and a lot of other factors, so best is to test with your data.

like image 23
ypercubeᵀᴹ Avatar answered Sep 21 '22 13:09

ypercubeᵀᴹ