Trying to count cumulative distinct entities using Redshift SQL

Question

I'm trying to get a cumulative count of distinct objects in Redshift over a time series. The straightforward thing would be to use COUNT(DISTINCT myfield) OVER (ORDER BY timefield DESC ROWS UNBOUNDED PRECEDING), but Redshift gives a "Window definition is not supported" error.

For example, the code below is trying to find the cumulative distinct users for every week from the first week to the present. However, I get the "Window function not supported" error.

SELECT user_time.weeks_ago, 
       COUNT(distinct user_time.user_id) OVER
            (ORDER BY weeks_ago desc ROWS UNBOUNDED PRECEDING) as count
FROM   (SELECT FLOOR(EXTRACT(DAY FROM sysdate - ev.time) / 7) AS weeks_ago,
               ev.user_id as user_id
        FROM events as ev
        WHERE ev.action='some_user_action') as user_time

The goal is to build a cumulative time series of unique users who have performed an action. Any ideas on how to do this?

albielin · Accepted Answer

Here's how to apply it to an example cited here, plus I've added another row duplicating 'table' for '2015-01-01' to demonstrate how this counts distincts.

The author of the example is wrong about the solution, but I'm just using his example.

create table public.test
(
  "date" date,
  item varchar(8),
  measure int
)

insert into public.test
    values
      ('2015-01-01', 'table',   12),
      ('2015-01-01', 'table',   120),
      ('2015-01-01', 'chair',   51),
      ('2015-01-01', 'lamp',    8),
      ('2015-01-02', 'table',   17),
      ('2015-01-02', 'chair',   72),
      ('2015-01-02', 'lamp',    23),
      ('2015-01-02', 'bed',     1),
      ('2015-01-02', 'dresser', 2),
      ('2015-01-03', 'bed',     1);

WITH x AS (
    SELECT
      *,
      DENSE_RANK()
      OVER (PARTITION BY date
        ORDER BY item) AS dense_rank
    FROM public.test
)
SELECT
  "date",
  item,
  measure,
  max(dense_rank)
  OVER (PARTITION BY "date")
FROM x
ORDER BY 1;

The CTE gets you the dense rank of each item per date, then the main query gets you the max of that dense rank per date, i.e., the distinct count of items per date.

You need the dense rank rather than straight rank to count distincts.

Aneil Mallavarapu · Answer

Figured out the answer. The trick turned out to be a set of nested subqueries, the inner one calculates the time of each user's first action. The middle subquery counts the total actions per time period, and the final outer query performs the cumulative sums over the time series:

(SELECT engaged_per_week.week as week,
       SUM(engaged_per_week.total) over (order by engaged_per_week.week DESC ROWS UNBOUNDED PRECEDING) as total
 FROM 
    -- COUNT OF FIRST TIME ENGAGEMENTS PER WEEK
    (SELECT engaged.first_week AS week,
            count(engaged.first_week) AS total
    FROM
       -- WEEK OF FIRST ENGAGEMENT FOR EACH USER
       (SELECT  MAX(FLOOR(EXTRACT(DAY FROM sysdate - ev.time) / 7)) as first_week
        FROM     events ev
        WHERE    ev.name='some_user_action'
        GROUP BY ev.user_id) AS engaged

    GROUP BY week) as engaged_per_week
ORDER BY week DESC) as cumulative_engaged

Trying to count cumulative distinct entities using Redshift SQL

Tags:

sql

amazon-redshift

Aneil Mallavarapu

2 Answers

albielin

Aneil Mallavarapu

Recent Activity

Donate For Us

Trying to count cumulative distinct entities using Redshift SQL

Tags:

sql

amazon-redshift

Aneil Mallavarapu

2 Answers

albielin

Aneil Mallavarapu

Related questions

Recent Activity

Donate For Us