Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cumulative distinct count with Spark SQL

Using Spark 1.6.2.

Here the data:

day | visitorID
-------------
1   | A
1   | B
2   | A
2   | C
3   | A
4   | A

I want to count how many distinct visitors by day + cumul with the day before (I dont know the exact term for that, sorry).

This should give:

day | visitors
--------------
 1  | 2 (A+B)
 2  | 3 (A+B+C)
 3  | 3 
 4  | 3
  • Tried self-join but really too slow
  • I am sure windowed function is what I am looking for but didnt manage to find it :/
like image 343
Thomas Decaux Avatar asked Jun 27 '17 13:06

Thomas Decaux


People also ask

How do I count distinct values in spark?

In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.

How do you calculate cumulative count in SQL?

Cumulative Sum in SQL Server : In SQL server also you can calculate cumulative sum by using sum function. We can use same table as sample table. select dept_no Department_no, count(empno) Employee_Per_Dept, sum(count(*)) over (order by deptno) Cumulative_Total from [DBO].

Can we use count with distinct in SQL?

Yes, you can use COUNT() and DISTINCT together to display the count of only distinct rows. SELECT COUNT(DISTINCT yourColumnName) AS anyVariableName FROM yourTableName; To understand the above syntax, let us create a table.

How can I get distinct count of records in SQL?

The COUNT DISTINCT function returns the number of unique values in the column or expression, as the following example shows. SELECT COUNT (DISTINCT item_num) FROM items; If the COUNT DISTINCT function encounters NULL values, it ignores them unless every value in the specified column is NULL.


2 Answers

You should be able to do:

select day, max(visitors) as visitors
from (select day,
             count(distinct visitorId) over (order by day) as visitors
      from t
     ) d
group by day;

Actually, I think a better approach is to record a visitor only on the first day s/he appears:

select startday, sum(count(*)) over (order by startday) as visitors
from (select visitorId, min(day) as startday
      from t
      group by visitorId
     ) t
group by startday
order by startday;
like image 173
Gordon Linoff Avatar answered Sep 20 '22 02:09

Gordon Linoff


In SQL, you could do this.

select t1.day,sum(max(t.cnt)) over(order by t1.day) as visitors
from tbl t1
left join (select minday,count(*) as cnt 
           from (select visitorID,min(day) as minday 
                 from tbl 
                 group by visitorID
                ) t 
           group by minday
          ) t 
on t1.day=t.minday
group by t1.day
  • Get the first day a visitorID appears using min.
  • Count the rows per such minday found above.
  • Left join this to your original table and get the cumulative sum.

Another approach would be

select t1.day,sum(count(t.visitorid)) over(order by t1.day) as cnt 
from tbl t1
left join (select visitorID,min(day) as minday 
           from tbl 
           group by visitorID
          ) t 
on t1.day=t.minday and t.visitorid=t1.visitorid
group by t1.day
like image 27
Vamsi Prabhala Avatar answered Sep 21 '22 02:09

Vamsi Prabhala