Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL - sum data for all time, 30 days and 90 days for multiple columns indiviually

BACKGROUND:

I have data that looks like this

date        src    subsrc   subsubsrc   param1  param2
2020-02-01  src1    ksjd    dfd8        47      31    
2020-02-02  src1    djsk    zmnc        44      95    
2020-02-03  src2    skdj    awes        92      100   
2020-02-04  src2    mxsf    kajs        80      2     
2020-02-05  src3    skdj    asio        46      53    
2020-02-06  src3    dekl    jdqo        19      18    
2020-02-07  src3    dskl    dqqq        69      18    
2020-02-08  src4    sqip    riow        64      46    
2020-02-09  src5    ss01    qwep        34      34    

I am trying to aggregate for all time, last 30 days and last 90 days (no rolling sum)

So my final data would look like this:

src     subsrc  subsubsrc   p1_all  p1_30   p1_90   p2_all  p2_30   p2_90
src1    ksjd    dfd8        7       1       7       98      7        98
src1    djsk    zmnc        0       0       0       0       0         0
src2    skdj    awes        12      12      12      4       4         4
src2    mxsf    kajs        6       6       6       31      31       31
src3    skdj    asio        0       0       0       0       0         0
src3    dekl    jdqo        20      20      20      17      17        17
src3    dskl    dqqq        3       3       3       4       4         4
src4    sqip    qwep        0       0       0       0       0         0
src5    ss01    qwes        15      15      15      2       2         2

ABOUT DATA:

  • This is only dummy data and therefore incorrect.
  • There are tens of thousands of rows in my data.
  • There are a dozen of src columns that make up the key for the table.
  • There are a dozen of param columns that I have to sum for 30 and 90 and all time.
  • Also there are null values in param columns.
  • Also there are might be multiple rows for same day and src column.
  • New data is being added every day and the query is probably going to be run every day to get the latest 30, 90, all time data.

WHAT I HAVE TRIED:

This is what I have come up with:

SELECT src, subsubsrc, subsubsrc,
SUM(param1) as param1_all,
SUM(CASE WHEN DATE_DIFF(CURRENT_DATE,date,day) <= 30 THEN param1 END) as param1_30,
SUM(CASE WHEN DATE_DIFF(CURRENT_DATE,date,day) <= 90 THEN param1 END) as param1_90,
SUM(param2) as param2_all,
SUM(CASE WHEN DATE_DIFF(CURRENT_DATE,date,day) <= 30 THEN param2 END) as param2_30,
SUM(CASE WHEN DATE_DIFF(CURRENT_DATE,date,day) <= 90 THEN param2 END) as param2_90,
FROM `MY_TABLE`
GROUP BY src
ORDER BY src

This actually works but I can anticipate how long this query is going to become for multiple sources and even more param columns.

I have been trying something called "Filtered aggregate functions (or manual pivot)" explained HERE. But I am unable to understand/implement it for my case.

Also I have looked at dozens of answers and most of them are running sums for each day OR are complicated cases of this basic calculation. Maybe I am not searching it correctly.

As you can see I am newbie in SQL and would really appreciate any help.

like image 816
Urvah Shabbir Avatar asked Feb 18 '20 14:02

Urvah Shabbir


1 Answers

Your query looks quite good; conditional aggregation is the canonical method to pivot a dataset.

One way to possibly increase performance would be to change the date filter in the conditional expressions: using a date function precludes the use of an index.

Instead, you could phrase this as:

select 
    src, 
    subsrc, 
    subsubsrc,
    sum(param1) as param1_all,
    sum(case when date >= current_date - interval 30 day then param1 end) as param1_30,
    sum(case when date >= current_date - interval 90 day then param1 end) as param1_90,
    sum(param2) as param2_all,
    sum(case when date >= current_date - interval 30 day then param2 end) as param2_30,
    sum(case when date >= current_date - interval 90 day then param2 end) as param2_90
from my_table
group by src, subsrc, subsubsrc
order by src, subsrc, subsubsrc

For performance, the following index may be helpul: (src, subsrc, subsubsrc, date).

Note that I included all three non-aggregated columns (src, subsrc, subsubsrc) in the group by clause: starting MySQL 5.7, this is by default mandatory (although you can play around with sql modes to alter that behavior) - and most other databases implement the same constraint.

like image 113
GMB Avatar answered Sep 28 '22 08:09

GMB