I'm trying to get a cumulative count of distinct objects in Redshift over a time series. The straightforward thing would be to use COUNT(DISTINCT myfield) OVER (ORDER BY timefield DESC ROWS UNBOUNDED PRECEDING)
, but Redshift gives a "Window definition is not supported" error.
For example, the code below is trying to find the cumulative distinct users for every week from the first week to the present. However, I get the "Window function not supported" error.
SELECT user_time.weeks_ago,
COUNT(distinct user_time.user_id) OVER
(ORDER BY weeks_ago desc ROWS UNBOUNDED PRECEDING) as count
FROM (SELECT FLOOR(EXTRACT(DAY FROM sysdate - ev.time) / 7) AS weeks_ago,
ev.user_id as user_id
FROM events as ev
WHERE ev.action='some_user_action') as user_time
The goal is to build a cumulative time series of unique users who have performed an action. Any ideas on how to do this?
Here's how to apply it to an example cited here, plus I've added another row duplicating 'table' for '2015-01-01' to demonstrate how this counts distincts.
The author of the example is wrong about the solution, but I'm just using his example.
create table public.test
(
"date" date,
item varchar(8),
measure int
)
insert into public.test
values
('2015-01-01', 'table', 12),
('2015-01-01', 'table', 120),
('2015-01-01', 'chair', 51),
('2015-01-01', 'lamp', 8),
('2015-01-02', 'table', 17),
('2015-01-02', 'chair', 72),
('2015-01-02', 'lamp', 23),
('2015-01-02', 'bed', 1),
('2015-01-02', 'dresser', 2),
('2015-01-03', 'bed', 1);
WITH x AS (
SELECT
*,
DENSE_RANK()
OVER (PARTITION BY date
ORDER BY item) AS dense_rank
FROM public.test
)
SELECT
"date",
item,
measure,
max(dense_rank)
OVER (PARTITION BY "date")
FROM x
ORDER BY 1;
The CTE gets you the dense rank of each item per date, then the main query gets you the max of that dense rank per date, i.e., the distinct count of items per date.
You need the dense rank rather than straight rank to count distincts.
Figured out the answer. The trick turned out to be a set of nested subqueries, the inner one calculates the time of each user's first action. The middle subquery counts the total actions per time period, and the final outer query performs the cumulative sums over the time series:
(SELECT engaged_per_week.week as week,
SUM(engaged_per_week.total) over (order by engaged_per_week.week DESC ROWS UNBOUNDED PRECEDING) as total
FROM
-- COUNT OF FIRST TIME ENGAGEMENTS PER WEEK
(SELECT engaged.first_week AS week,
count(engaged.first_week) AS total
FROM
-- WEEK OF FIRST ENGAGEMENT FOR EACH USER
(SELECT MAX(FLOOR(EXTRACT(DAY FROM sysdate - ev.time) / 7)) as first_week
FROM events ev
WHERE ev.name='some_user_action'
GROUP BY ev.user_id) AS engaged
GROUP BY week) as engaged_per_week
ORDER BY week DESC) as cumulative_engaged
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With