Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count cumulative total in Postgresql

I am using count and group by to get the number of subscribers registered each day:

  SELECT created_at, COUNT(email)       FROM subscriptions  GROUP BY created at; 

Result:

created_at  count ----------------- 04-04-2011  100 05-04-2011   50 06-04-2011   50 07-04-2011  300 

I want to get the cumulative total of subscribers every day instead. How do I get this?

created_at  count ----------------- 04-04-2011  100 05-04-2011  150 06-04-2011  200 07-04-2011  500 
like image 243
khairul Avatar asked Apr 18 '11 04:04

khairul


People also ask

How do you calculate cumulative in SQL?

A Cumulative total or running total refers to the sum of values in all cells of a column that precedes the next cell in that particular column. As you can see the below screenshot which displays a cumulative total in column RUNNING TOTAL for column Value .

How do I count values in PostgreSQL?

COUNT(*) The COUNT(*) function returns the number of rows returned by a SELECT statement, including NULL and duplicates. When you apply the COUNT(*) function to the entire table, PostgreSQL has to scan the whole table sequentially. If you use the COUNT(*) function on a big table, the query will be slow.

What is cumulative count?

The cumulative count function is the sum of all the counts generated so far. Mathematically it is represented as. Sk = ∑Ci for i = 1 to k. When k = 3, i.e., 3rd cumulative count is calculated by adding up the first 3 counts C1, C2, C3.


2 Answers

With larger datasets, window functions are the most efficient way to perform these kinds of queries -- the table will be scanned only once, instead of once for each date, like a self-join would do. It also looks a lot simpler. :) PostgreSQL 8.4 and up have support for window functions.

This is what it looks like:

SELECT created_at, sum(count(email)) OVER (ORDER BY created_at) FROM subscriptions GROUP BY created_at; 

Here OVER creates the window; ORDER BY created_at means that it has to sum up the counts in created_at order.


Edit: If you want to remove duplicate emails within a single day, you can use sum(count(distinct email)). Unfortunately this won't remove duplicates that cross different dates.

If you want to remove all duplicates, I think the easiest is to use a subquery and DISTINCT ON. This will attribute emails to their earliest date (because I'm sorting by created_at in ascending order, it'll choose the earliest one):

SELECT created_at, sum(count(email)) OVER (ORDER BY created_at) FROM (     SELECT DISTINCT ON (email) created_at, email     FROM subscriptions ORDER BY email, created_at ) AS subq GROUP BY created_at; 

If you create an index on (email, created_at), this query shouldn't be too slow either.


(If you want to test, this is how I created the sample dataset)

create table subscriptions as    select date '2000-04-04' + (i/10000)::int as created_at,           '[email protected]' || (i%700000)::text as email    from generate_series(1,1000000) i; create index on subscriptions (email, created_at); 
like image 183
intgr Avatar answered Oct 01 '22 03:10

intgr


Use:

SELECT a.created_at,        (SELECT COUNT(b.email)           FROM SUBSCRIPTIONS b          WHERE b.created_at <= a.created_at) AS count   FROM SUBSCRIPTIONS a 
like image 38
OMG Ponies Avatar answered Oct 01 '22 03:10

OMG Ponies