In the the table T, it is guaranteed that each value of column A is associated with exactly one value of column B (i.e. that there is a functional dependency A → B).  Because of this both of the queries below return the same results.  Which one will generally run faster?
Using GROUP BY on A and B
select
   A
   ,B
   ,sum(C) 
from
   T
group by
   A
   ,B
or using MAX/MIN on B?
select
   A
   ,MAX(B)
   ,sum(C) 
from
   T
group by
   A
I do know that the GROUP BY A and B version is better at not concealing data issues where an A arrives that is associated with more than one B, I'm just curious about whether one of the queries is generally more work for a DBMS to execute.  If the answer depends entirely on the choice of DBMS and you still have interesting information to share then choose your favourite DBMS and answer only for it.
1.17 GROUP BY Optimization. The most general way to satisfy a GROUP BY clause is to scan the whole table and create a new temporary table where all rows from each group are consecutive, and then use this temporary table to discover groups and apply aggregate functions (if any).
The main problem with GROUP BY is that queries involving it are usually slow, especially when compared with WHERE -only queries.
The GROUP BY statement is often used with aggregate functions ( COUNT() , MAX() , MIN() , SUM() , AVG() ) to group the result-set by one or more columns.
Well I went ahead and ran a test on SQL Server 2016 even though I was interested in uncovering more general, theory-based information.  I used four columns in the role of B above to accentuate any differences in run time and submitted a batch containing both types of query above.  The execution plans generated by SQL Server were almost identical but the cost reported for the GROUP BY query was 53% of the batch while that of the MAX/MIN query was 47%.
The initial index seek step is identical for both queries. It is followed by hash table building step in which the GROUP BY version incurs a higher cost than the MAX/MIN version. The steps after that have negligible cost for both versions.
Counter-intuitively, in spite of the GROUP BY version having a slightly higher cost, it runs in slightly less time. I guess it's still possible to consume more CPU cycles while running if parallelism is greater. At this point I've reached the end of my ability (and appetite) to scry DBMS execution plans so I'll leave it there.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With