Using Spark 1.6.2.
Here the data:
day | visitorID
-------------
1 | A
1 | B
2 | A
2 | C
3 | A
4 | A
I want to count how many distinct visitors by day + cumul with the day before (I dont know the exact term for that, sorry).
This should give:
day | visitors
--------------
1 | 2 (A+B)
2 | 3 (A+B+C)
3 | 3
4 | 3
In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.
Cumulative Sum in SQL Server : In SQL server also you can calculate cumulative sum by using sum function. We can use same table as sample table. select dept_no Department_no, count(empno) Employee_Per_Dept, sum(count(*)) over (order by deptno) Cumulative_Total from [DBO].
Yes, you can use COUNT() and DISTINCT together to display the count of only distinct rows. SELECT COUNT(DISTINCT yourColumnName) AS anyVariableName FROM yourTableName; To understand the above syntax, let us create a table.
The COUNT DISTINCT function returns the number of unique values in the column or expression, as the following example shows. SELECT COUNT (DISTINCT item_num) FROM items; If the COUNT DISTINCT function encounters NULL values, it ignores them unless every value in the specified column is NULL.
You should be able to do:
select day, max(visitors) as visitors
from (select day,
count(distinct visitorId) over (order by day) as visitors
from t
) d
group by day;
Actually, I think a better approach is to record a visitor only on the first day s/he appears:
select startday, sum(count(*)) over (order by startday) as visitors
from (select visitorId, min(day) as startday
from t
group by visitorId
) t
group by startday
order by startday;
In SQL, you could do this.
select t1.day,sum(max(t.cnt)) over(order by t1.day) as visitors
from tbl t1
left join (select minday,count(*) as cnt
from (select visitorID,min(day) as minday
from tbl
group by visitorID
) t
group by minday
) t
on t1.day=t.minday
group by t1.day
min
. Another approach would be
select t1.day,sum(count(t.visitorid)) over(order by t1.day) as cnt
from tbl t1
left join (select visitorID,min(day) as minday
from tbl
group by visitorID
) t
on t1.day=t.minday and t.visitorid=t1.visitorid
group by t1.day
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With