Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add a running count to rows in a 'streak' of consecutive days

Thanks to Mike for the suggestion to add the create/insert statements.

create table test (
  pid integer not null,
  date date not null,
  primary key (pid, date)
);

insert into test values
  (1,'2014-10-1')
, (1,'2014-10-2')
, (1,'2014-10-3')
, (1,'2014-10-5')
, (1,'2014-10-7')
, (2,'2014-10-1')
, (2,'2014-10-2')
, (2,'2014-10-3')
, (2,'2014-10-5')
, (2,'2014-10-7');

I want to add a new column that is 'days in current streak' so the result would look like:

pid    | date      | in_streak
-------|-----------|----------
1      | 2014-10-1 | 1
1      | 2014-10-2 | 2
1      | 2014-10-3 | 3
1      | 2014-10-5 | 1
1      | 2014-10-7 | 1
2      | 2014-10-2 | 1
2      | 2014-10-3 | 2
2      | 2014-10-4 | 3
2      | 2014-10-6 | 1

I've been trying to use the answers from

  • PostgreSQL: find number of consecutive days up until now
  • Return rows of the latest 'streak' of data

but I can't work out how to use the dense_rank() trick with other window functions to get the right result.

like image 284
Ben Avatar asked Jan 30 '15 00:01

Ben


2 Answers

Building on this table (not using the SQL keyword "date" as column name.):

CREATE TABLE tbl(
  pid int
, the_date date
, PRIMARY KEY (pid, the_date)
);

Query:

SELECT pid, the_date
     , row_number() OVER (PARTITION BY pid, grp ORDER BY the_date) AS in_streak
FROM  (
   SELECT *
        , the_date - '2000-01-01'::date
        - row_number() OVER (PARTITION BY pid ORDER BY the_date) AS grp
   FROM   tbl
) sub
ORDER  BY pid, the_date;

Subtracting a date from another date yields an integer. Since you are looking for consecutive days, every next row would be greater by one. If we subtract row_number() from that, the whole streak ends up in the same group (grp) per pid. Then it's simple to deal out number per group.

grp is calculated with two subtractions, which should be fastest. An equally fast alternative could be:

the_date - row_number() OVER (PARTITION BY pid ORDER BY the_date) * interval '1d' AS grp

One multiplication, one subtraction. String concatenation and casting is more expensive. Test with EXPLAIN ANALYZE.

Don't forget to partition by pid additionally in both steps, or you'll inadvertently mix groups that should be separated.

Using a subquery, since that is typically faster than a CTE. There is nothing here that a plain subquery couldn't do.

And since you mentioned it: dense_rank() is obviously not necessary here. Basic row_number() does the job.

like image 145
Erwin Brandstetter Avatar answered Nov 16 '22 16:11

Erwin Brandstetter


You'll get more attention if you include CREATE TABLE statements and INSERT statements in your question.

create table test (
  pid integer not null,
  date date not null,
  primary key (pid, date)
);

insert into test values
(1,'2014-10-1'), (1,'2014-10-2'), (1,'2014-10-3'), (1,'2014-10-5'),
(1,'2014-10-7'), (2,'2014-10-1'), (2,'2014-10-2'), (2,'2014-10-3'),
(2,'2014-10-5'), (2,'2014-10-7');

The principle is simple. A streak of distinct, consecutive dates minus row_number() is a constant. You can group by the constant, and take the dense_rank() over that result.

with grouped_dates as (
  select pid, date, 
         (date - (row_number() over (partition by pid order by date) || ' days')::interval)::date as grouping_date
  from test
)
select * , dense_rank() over (partition by grouping_date order by date) as in_streak
from grouped_dates
order by pid, date
pid  date         grouping_date  in_streak
--
1    2014-10-01   2014-09-30     1
1    2014-10-02   2014-09-30     2
1    2014-10-03   2014-09-30     3
1    2014-10-05   2014-10-01     1
1    2014-10-07   2014-10-02     1
2    2014-10-01   2014-09-30     1
2    2014-10-02   2014-09-30     2
2    2014-10-03   2014-09-30     3
2    2014-10-05   2014-10-01     1
2    2014-10-07   2014-10-02     1
like image 3
Mike Sherrill 'Cat Recall' Avatar answered Nov 16 '22 16:11

Mike Sherrill 'Cat Recall'