Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL - Join Aggregated query or Aggregate/Sum after join?

I have a hard time figuring out what is best, or if there is difference at all, however i have not found any material to help my understanding of this, so i will ask this question, if not for me, then for others who might end up in the same situation.

Aggregating a sub-query before or after a join, in my specific situation the sub-query is rather slow due to fragmented data and bad normalization procedure,

I got a main query that is highly complex and a sub-query that is built from 3 small queries that is combined using union (will remove duplicate records) i only need a single value from this sub-query (for each line), so at some point i will end up summing this value, (together with grouping the necessary control data with it so i can join)

what will have the greatest impact?

  • To sum sub-query before the join and then join with the aggregated version
  • To leave the data raw, and then sum the value together with the rest of the main query

remember there are thousands of records that will be summed for each line, and the data is not native but built, and therefore may reside in memory, (that is just a guess from the query optimizers perspective)

like image 331
Christopher Bonitz Avatar asked Sep 19 '25 19:09

Christopher Bonitz


2 Answers

Usually I keep the group-by inside the subquery (referred as "inline view" in Oracle lingo). This way the query is much more simple and clear. Also I believe the execution plan is more efficient, because the data set to be aggregated is smaller and the resulting set of join keys is also smaller.

This is not a definitive answer though. If the row source that you are joining to the inline view has few matching rows, you may find that a early join reduces the aggregation effort.

The right anwer is: benchmark the queries for your particular data set.

like image 52
colemar Avatar answered Sep 22 '25 08:09

colemar


I think in such a general way there is no right or wrong way to do it. The performance from a query like the one that you describe depends on many different factors:

  • what kind of join are you actually doing (and what algorithm is used in the background)
  • is the data to be joined small enough to fit into the memory of the machine joining it?
  • what query optimizations are you using, i.e. what DBMS (Oracle, MsSQL, MySQL, ...)
  • ...

For your case I simply suggest benchmarking. I'm sorry if that does not seem like a satisfactory answer, but it is the way to go in many performance questions...

So set up a simple test using both your approaches and some test data, then pick whatever is faster.

like image 26
fgysin Avatar answered Sep 22 '25 09:09

fgysin