How to count a boolean in grouped Spark data frame

Tags:

I want to count how many of records are true in a column from a grouped Spark dataframe but I don't know how to do that in python. For example, I have a data with a region, salary and IsUnemployed column with IsUnemployed as a Boolean. I want to see how many unemployed people in each region. I know we can do a filter and then groupby but I want to generate two aggregation at the same time as below

from pyspark.sql import functions as F  
data.groupby("Region").agg(F.avg("Salary"), F.count("IsUnemployed"))

322

asked Feb 18 '16 22:02

MYjx

1 Answers

Probably the simplest solution is a plain CAST (C style where TRUE -> 1, FALSE -> 0) with SUM:

(data
    .groupby("Region")
    .agg(F.avg("Salary"), F.sum(F.col("IsUnemployed").cast("long"))))

A little bit more universal and idiomatic solution is CASE WHEN with COUNT:

(data
    .groupby("Region")
    .agg(
        F.avg("Salary"),
        F.count(F.when(F.col("IsUnemployed"), F.col("IsUnemployed")))))

but here it is clearly an overkill.

145

answered Sep 30 '22 02:09

zero323

Related questions
                            
                                Using Python to Remove All Lines Matching Regex
                            
                                pandas group by year, rank by sales column, in a dataframe with duplicate data
                            
                                pymongo method of getting statistics for collection byte usage?
                            
                                Can I use 'eval' to define a function in Python?
                            
                                Sum up column values in Pandas DataFrame
                            
                                What does the Brown clustering algorithm output mean?
                            
                                Python service discovery: Advertise a service across a local network
                            
                                What does an 'x = y or z' assignment do in Python?
                            
                                Searching by related fields in django admin
                            
                                Comparing two dictionaries with numpy matrices as values
                            
                                Pickles: Why are they called that? [closed]
                            
                                Why does 2to3 change mydict.keys() to list(mydict.keys())?
                            
                                Django update on queryset to change ID of ForeignKey
                            
                                How to get the index of an integer from a list if the list contains a boolean?
                            
                                How to use Flask-WTForms CSRF protection with AJAX?
                            
                                tkinter - How to set font for Text?
                            
                                Find Closest Vector from a List of Vectors | Python
                            
                                Sparksql filtering (selecting with where clause) with multiple conditions
                            
                                Python click, Can you make -h as an alias
                            
                                How does Python's matplotlib.pyplot.quiver exactly work?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to count a boolean in grouped Spark data frame

Tags:

python

sql

apache-spark

apache-spark-sql

pyspark

MYjx

People also ask

1 Answers

zero323

Recent Activity

Donate For Us