I am using count
and group by
to get the number of subscribers registered each day:
SELECT created_at, COUNT(email) FROM subscriptions GROUP BY created at;
Result:
created_at count ----------------- 04-04-2011 100 05-04-2011 50 06-04-2011 50 07-04-2011 300
I want to get the cumulative total of subscribers every day instead. How do I get this?
created_at count ----------------- 04-04-2011 100 05-04-2011 150 06-04-2011 200 07-04-2011 500
A Cumulative total or running total refers to the sum of values in all cells of a column that precedes the next cell in that particular column. As you can see the below screenshot which displays a cumulative total in column RUNNING TOTAL for column Value .
COUNT(*) The COUNT(*) function returns the number of rows returned by a SELECT statement, including NULL and duplicates. When you apply the COUNT(*) function to the entire table, PostgreSQL has to scan the whole table sequentially. If you use the COUNT(*) function on a big table, the query will be slow.
The cumulative count function is the sum of all the counts generated so far. Mathematically it is represented as. Sk = ∑Ci for i = 1 to k. When k = 3, i.e., 3rd cumulative count is calculated by adding up the first 3 counts C1, C2, C3.
With larger datasets, window functions are the most efficient way to perform these kinds of queries -- the table will be scanned only once, instead of once for each date, like a self-join would do. It also looks a lot simpler. :) PostgreSQL 8.4 and up have support for window functions.
This is what it looks like:
SELECT created_at, sum(count(email)) OVER (ORDER BY created_at) FROM subscriptions GROUP BY created_at;
Here OVER
creates the window; ORDER BY created_at
means that it has to sum up the counts in created_at
order.
Edit: If you want to remove duplicate emails within a single day, you can use sum(count(distinct email))
. Unfortunately this won't remove duplicates that cross different dates.
If you want to remove all duplicates, I think the easiest is to use a subquery and DISTINCT ON
. This will attribute emails to their earliest date (because I'm sorting by created_at in ascending order, it'll choose the earliest one):
SELECT created_at, sum(count(email)) OVER (ORDER BY created_at) FROM ( SELECT DISTINCT ON (email) created_at, email FROM subscriptions ORDER BY email, created_at ) AS subq GROUP BY created_at;
If you create an index on (email, created_at)
, this query shouldn't be too slow either.
(If you want to test, this is how I created the sample dataset)
create table subscriptions as select date '2000-04-04' + (i/10000)::int as created_at, '[email protected]' || (i%700000)::text as email from generate_series(1,1000000) i; create index on subscriptions (email, created_at);
Use:
SELECT a.created_at, (SELECT COUNT(b.email) FROM SUBSCRIPTIONS b WHERE b.created_at <= a.created_at) AS count FROM SUBSCRIPTIONS a
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With