Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate a date range + count earlier dates from another table in PostgreSQL?

I have the following table:

links:

created_at           active 
2017-08-12 15:46:01  false
2017-08-13 15:46:01  true
2017-08-14 15:46:01  true
2017-08-15 15:46:01  false

When given a date range, I have to extract time series which tells me how many active links were created on a date equal or smaller than current (rolling) date.

Output (for date range 2017-08-12 - 2017-08-17):

day          count
2017-08-12   0 (there are 0 active links created on 2017-08-12 and earlier)
2017-08-13   1 (there is 1 active link created on 2017-08-13 and earlier)
2017-08-14   2 (there are 2 active links created on 2017-08-14 and earlier)
2017-08-15   2 ...
2017-08-16   2
2017-08-17   2

I came up with the following query for generating dates:

SELECT date_trunc('day', dd):: date
FROM generate_series
    ( '2017-08-12'::timestamp 
    , '2017-08-17'::timestamp
    , '1 day'::interval) dd

But the rolling counts confuse me and am unsure how to continue. Can this be solved with a window function?

like image 936
Matic Jurglič Avatar asked Sep 13 '17 21:09

Matic Jurglič


2 Answers

This should be fastest:

SELECT day::date
     , sum(ct) OVER (ORDER BY day) AS count
FROM   generate_series (timestamp '2017-08-12'
                      , timestamp '2017-08-17'
                      , interval  '1 day') day
LEFT   JOIN  (
   SELECT date_trunc('day', created_at) AS day, count(*) AS ct
   FROM   tbl
   WHERE  active -- fastest
   GROUP  BY 1
   ) t USING (day)
ORDER  BY 1;

dbfiddle here

count() only counts non-null rows, so you could use count(active OR NULL). But the fastest option for counting is to exclude irrelevant rows with a WHERE clause to begin with. Since we are adding all days with generate_series() anyway, this is the best option.

Compare:

  • For absolute performance, is SUM faster or COUNT?

Since generate_series() returns timestamp (not date) I use date_trunc() to get matching timestamps (very slightly faster).

like image 162
Erwin Brandstetter Avatar answered Sep 27 '22 18:09

Erwin Brandstetter


I would just use aggregation and cumulative sums -- assuming you have at least one per day:

select date_trunc('day', created_at)::date as created_date,
       sum(active::int) as actives,
       sum(sum(active::int)) over (date_trunc('day', created_at)) as running_actives
from t
group by created_date;

You only need to generate the dates if you have holes in the data. If you do, though, I would recommend including where active -- you can include it now, I just want to be sure there are no holes.

like image 26
Gordon Linoff Avatar answered Sep 27 '22 19:09

Gordon Linoff