Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed of MIN/MAX vs GROUP BY for aggregating constant-in-group values

Tags:

In the the table T, it is guaranteed that each value of column A is associated with exactly one value of column B (i.e. that there is a functional dependency A → B). Because of this both of the queries below return the same results. Which one will generally run faster?

Using GROUP BY on A and B

select
   A
   ,B
   ,sum(C) 
from
   T
group by
   A
   ,B

or using MAX/MIN on B?

select
   A
   ,MAX(B)
   ,sum(C) 
from
   T
group by
   A

I do know that the GROUP BY A and B version is better at not concealing data issues where an A arrives that is associated with more than one B, I'm just curious about whether one of the queries is generally more work for a DBMS to execute. If the answer depends entirely on the choice of DBMS and you still have interesting information to share then choose your favourite DBMS and answer only for it.

like image 892
Dzamo Norton Avatar asked Jan 30 '17 15:01

Dzamo Norton


People also ask

How do you optimize a GROUP BY?

1.17 GROUP BY Optimization. The most general way to satisfy a GROUP BY clause is to scan the whole table and create a new temporary table where all rows from each group are consecutive, and then use this temporary table to discover groups and apply aggregate functions (if any).

Is GROUP BY slow?

The main problem with GROUP BY is that queries involving it are usually slow, especially when compared with WHERE -only queries.

Can you GROUP BY an aggregate?

The GROUP BY statement is often used with aggregate functions ( COUNT() , MAX() , MIN() , SUM() , AVG() ) to group the result-set by one or more columns.


1 Answers

Well I went ahead and ran a test on SQL Server 2016 even though I was interested in uncovering more general, theory-based information. I used four columns in the role of B above to accentuate any differences in run time and submitted a batch containing both types of query above. The execution plans generated by SQL Server were almost identical but the cost reported for the GROUP BY query was 53% of the batch while that of the MAX/MIN query was 47%.

The initial index seek step is identical for both queries. It is followed by hash table building step in which the GROUP BY version incurs a higher cost than the MAX/MIN version. The steps after that have negligible cost for both versions.

Counter-intuitively, in spite of the GROUP BY version having a slightly higher cost, it runs in slightly less time. I guess it's still possible to consume more CPU cycles while running if parallelism is greater. At this point I've reached the end of my ability (and appetite) to scry DBMS execution plans so I'll leave it there.

like image 96
Dzamo Norton Avatar answered Sep 22 '22 10:09

Dzamo Norton