Using Spark 1.6.2. Here the data: <pre class="prettyprint"><code>day | visitorID ------------- 1 | A 1 | B 2 | A 2 | C 3 | A 4 | A </code></pre> I want to count how many distinct visitors by day + cumul with the day before (I dont know the exact term for that, sorry). This should give: <pre class="prettyprint"><code>day | visitors -------------- 1 | 2 (A+B) 2 | 3 (A+B+C) 3 | 3 4 | 3 </code></pre> <ul> <li>Tried self-join but really too slow</li> <li>I am sure windowed function is what I am looking for but didnt manage to find it :/</li> </ul>

In SQL, you could do this. <pre class="prettyprint"><code>select t1.day,sum(max(t.cnt)) over(order by t1.day) as visitors from tbl t1 left join (select minday,count(*) as cnt from (select visitorID,min(day) as minday from tbl group by visitorID ) t group by minday ) t on t1.day=t.minday group by t1.day </code></pre> <ul> <li>Get the first day a visitorID appears using <code>min</code>. </li> <li>Count the rows per such minday found above.</li> <li>Left join this to your original table and get the cumulative sum.</li> </ul> Another approach would be <pre class="prettyprint"><code>select t1.day,sum(count(t.visitorid)) over(order by t1.day) as cnt from tbl t1 left join (select visitorID,min(day) as minday from tbl group by visitorID ) t on t1.day=t.minday and t.visitorid=t1.visitorid group by t1.day </code></pre>

Cumulative distinct count with Spark SQL

Tags:

sql

apache-spark

apache-spark-sql

Using Spark 1.6.2.

Here the data:

Click to copy

day | visitorID
-------------
1   | A
1   | B
2   | A
2   | C
3   | A
4   | A

I want to count how many distinct visitors by day + cumul with the day before (I dont know the exact term for that, sorry).

This should give:

Click to copy

day | visitors
--------------
 1  | 2 (A+B)
 2  | 3 (A+B+C)
 3  | 3 
 4  | 3

Tried self-join but really too slow
I am sure windowed function is what I am looking for but didnt manage to find it :/

343

asked Jun 27 '17 13:06

Thomas Decaux

2 Answers

You should be able to do:

Click to copy

select day, max(visitors) as visitors
from (select day,
             count(distinct visitorId) over (order by day) as visitors
      from t
     ) d
group by day;

Actually, I think a better approach is to record a visitor only on the first day s/he appears:

Click to copy

select startday, sum(count(*)) over (order by startday) as visitors
from (select visitorId, min(day) as startday
      from t
      group by visitorId
     ) t
group by startday
order by startday;

173

answered Sep 20 '22 02:09

Gordon Linoff

In SQL, you could do this.

Click to copy

select t1.day,sum(max(t.cnt)) over(order by t1.day) as visitors
from tbl t1
left join (select minday,count(*) as cnt 
           from (select visitorID,min(day) as minday 
                 from tbl 
                 group by visitorID
                ) t 
           group by minday
          ) t 
on t1.day=t.minday
group by t1.day

Get the first day a visitorID appears using min.
Count the rows per such minday found above.
Left join this to your original table and get the cumulative sum.

Another approach would be

Click to copy

select t1.day,sum(count(t.visitorid)) over(order by t1.day) as cnt 
from tbl t1
left join (select visitorID,min(day) as minday 
           from tbl 
           group by visitorID
          ) t 
on t1.day=t.minday and t.visitorid=t1.visitorid
group by t1.day

answered Sep 21 '22 02:09

Vamsi Prabhala

Related questions
                            
                                HQL left join with condition
                            
                                Oracle REGEXP_LIKE doesn't work as expected
                            
                                Strange translation of jOOQ query for array contains function
                            
                                SQL select items that make datetime range between flag toggle
                            
                                MySQL Performance - string vs integer
                            
                                SQL: How to change a row order position
                            
                                How to add xmlns in the root in XML in SQL Server 2014
                            
                                Oracle SQL - SELECT query locks index & blocks DML sessions
                            
                                What is the difference between NOT condition and NOT() in Oracle and MS SQL Server
                            
                                Can't commit changes to table with Datagrip
                            
                                Making a CHECK CONSTRAINT with OR in postgres SQL
                            
                                How to store datetime with millisecond precision in SQL database
                            
                                PostgreSQL - Start A Transaction block IN Function
                            
                                how to change column value in spark sql
                            
                                How to get history of table structure change in SQL Server
                            
                                How is SQL Injection Possible When Using Bind Variables?
                            
                                Which statistics is calculated faster in SAS, proc summary?
                            
                                Inserting generated UUID into table - can't adapt type UUID
                            
                                Sybase ASE connect by level equivalent
                            
                                SQLAlchemy is there a way to Returning result after Delete execution

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cumulative distinct count with Spark SQL

Tags:

sql

apache-spark

apache-spark-sql

Thomas Decaux

People also ask

2 Answers

Gordon Linoff

Vamsi Prabhala

Recent Activity

Donate For Us