I want to count how many of records are true in a column from a grouped Spark dataframe but I don't know how to do that in python. For example, I have a data with a region
, salary
and IsUnemployed
column with IsUnemployed
as a Boolean. I want to see how many unemployed people in each region. I know we can do a filter
and then groupby
but I want to generate two aggregation at the same time as below
from pyspark.sql import functions as F
data.groupby("Region").agg(F.avg("Salary"), F.count("IsUnemployed"))
When we perform groupBy() on PySpark Dataframe, it returns GroupedData object which contains below aggregate functions. count() – Use groupBy() count() to return the number of rows for each group. mean() – Returns the mean of values for each group. max() – Returns the maximum of values for each group.
In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.
The groupBy method is defined in the Dataset class. groupBy returns a RelationalGroupedDataset object where the agg() method is defined. Spark makes great use of object oriented programming! The RelationalGroupedDataset class also defines a sum() method that can be used to get the same result with less code.
Probably the simplest solution is a plain CAST
(C style where TRUE
-> 1, FALSE
-> 0) with SUM
:
(data
.groupby("Region")
.agg(F.avg("Salary"), F.sum(F.col("IsUnemployed").cast("long"))))
A little bit more universal and idiomatic solution is CASE WHEN
with COUNT
:
(data
.groupby("Region")
.agg(
F.avg("Salary"),
F.count(F.when(F.col("IsUnemployed"), F.col("IsUnemployed")))))
but here it is clearly an overkill.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With