I have a data set of email addresses and dates that those email addresses were added to a table. There can be multiple entries of an email address for various different dates. For example, if I have the data set below. I would be looking to get the date and count of distinct emails that we have between said date and 3 days ago.
Date | email
-------+----------------
1/1/12 | [email protected]
1/1/12 | [email protected]
1/1/12 | [email protected]
1/2/12 | [email protected]
1/2/12 | [email protected]
1/3/12 | [email protected]
1/4/12 | [email protected]
1/5/12 | [email protected]
1/5/12 | [email protected]
1/6/12 | [email protected]
1/6/12 | [email protected]
1/6/12 | [email protected]
Result set would look something like this if we use a date period of 3
date | count(distinct email)
-------+------
1/1/12 | 3
1/2/12 | 3
1/3/12 | 3
1/4/12 | 3
1/5/12 | 1
1/6/12 | 2
I can get a distinct count of a date range using the query below, but looking to get a count of a range by day so I do not have to manually update the range for hundreds of dates.
select test.date, count(distinct test.email)
from test_table as test
where test.date between '2012-01-01' and '2012-05-08'
group by test.date;
To count the number of different values that are stored in a given column, you simply need to designate the column you pass in to the COUNT function as DISTINCT . When given a column, COUNT returns the number of values in that column. Combining this with DISTINCT returns only the number of unique (and non-NULL) values.
Yes, you can use COUNT() and DISTINCT together to display the count of only distinct rows.
Right-click on any value in column B. Go to Value Field Settings. In the Summarize Values By tab, go to Summarize Value field by and set it to Distinct Count.
A simple COUNT(*) just has to count number of rows - no sorting involved, so it will always be faster than COUNT(DISTINCT) .
Test case:
CREATE TABLE tbl (date date, email text);
INSERT INTO tbl VALUES
('2012-01-01', '[email protected]')
, ('2012-01-01', '[email protected]')
, ('2012-01-01', '[email protected]')
, ('2012-01-02', '[email protected]')
, ('2012-01-02', '[email protected]')
, ('2012-01-03', '[email protected]')
, ('2012-01-04', '[email protected]')
, ('2012-01-05', '[email protected]')
, ('2012-01-05', '[email protected]')
, ('2012-01-06', '[email protected]')
, ('2012-01-06', '[email protected]')
, ('2012-01-06', '[email protected]`')
;
Query - returns only days where an entry exists in tbl
:
SELECT date
,(SELECT count(DISTINCT email)
FROM tbl
WHERE date BETWEEN t.date - 2 AND t.date -- period of 3 days
) AS dist_emails
FROM tbl t
WHERE date BETWEEN '2012-01-01' AND '2012-01-06'
GROUP BY 1
ORDER BY 1;
Or - return all days in the specified range, even if there are no rows for the day:
SELECT date
,(SELECT count(DISTINCT email)
FROM tbl
WHERE date BETWEEN g.date - 2 AND g.date
) AS dist_emails
FROM (SELECT generate_series(timestamp '2012-01-01'
, timestamp '2012-01-06'
, interval '1 day')::date) AS g(date);
db<>fiddle here
Result:
day | dist_emails
-----------+------------
2012-01-01 | 3
2012-01-02 | 3
2012-01-03 | 3
2012-01-04 | 3
2012-01-05 | 1
2012-01-06 | 2
This sounded like a job for window functions at first, but I did not find a way to define the suitable window frame. Also, per documentation:
Aggregate window functions, unlike normal aggregate functions, do not allow
DISTINCT
orORDER BY
to be used within the function argument list.
So I solved it with correlated subqueries instead. I guess that's the smartest way.
BTW, "between said date and 3 days ago" would be a period of 4 days. Your definition is contradictory there.
Slightly shorter, but slower for few days:
SELECT g.date, count(DISTINCT email) AS dist_emails
FROM (SELECT generate_series(timestamp '2012-01-01'
, timestamp '2012-01-06'
, interval '1 day')::date) AS g(date)
LEFT JOIN tbl t ON t.date BETWEEN g.date - 2 AND g.date
GROUP BY 1
ORDER BY 1;
Related:
A lateral join
is useful for such "sliding window" needs, like this:
SELECT
t.day
, ljl.dist_emails
FROM tbl t
LEFT JOIN LATERAL (
SELECT
count(DISTINCT email) as dist_emails
FROM tbl
WHERE day BETWEEN t.day - 2 AND t.day -- period of 3 days
) AS ljl ON TRUE
WHERE t.day BETWEEN '2012-01-01' AND '2012-01-06'
Note this is a variant to a previous query by Erwin Brandstetter, and it surprises me he hadn't suggested it, but these lateral joins excellent for this type of need.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With