Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark:How to calculate avg and count in a single groupBy? [duplicate]

I would like to calculate avg and count in a single group by statement in Pyspark. How can I do that?

df = spark.createDataFrame([(1, 'John', 1.79, 28,'M', 'Doctor'),
                        (2, 'Steve', 1.78, 45,'M', None),
                        (3, 'Emma', 1.75, None, None, None),
                        (4, 'Ashley',1.6, 33,'F', 'Analyst'),
                        (5, 'Olivia', 1.8, 54,'F', 'Teacher'),
                        (6, 'Hannah', 1.82, None, 'F', None),
                        (7, 'William', 1.7, 42,'M', 'Engineer'),
                        (None,None,None,None,None,None),
                        (8,'Ethan',1.55,38,'M','Doctor'),
                        (9,'Hannah',1.65,None,'F','Doctor')]
                       , ['Id', 'Name', 'Height', 'Age', 'Gender', 'Profession'])

#This only shows avg but also I need count right next to it. How can I do that?

df.groupBy("Profession").agg({"Age":"avg"}).show()
df.show()

Thank you.

like image 809
melik Avatar asked Aug 01 '18 11:08

melik


People also ask

How do you use groupBy and count in PySpark?

PySpark Groupby Count is used to get the number of records for each group. So to perform the count, first, you need to perform the groupBy() on DataFrame which groups the records based on single or multiple column values, and then do the count() to get the number of records for each group.

How do you count duplicates in PySpark?

In PySpark, you can use distinct(). count() of DataFrame or countDistinct() SQL function to get the count distinct. distinct() eliminates duplicate records(matching all columns of a Row) from DataFrame, count() returns the count of records on DataFrame.

How do you calculate average in PySpark?

Method -1 : Using select() method If we want to return the average value from multiple columns, we have to use the avg() method inside the select() method by specifying the column name separated by a comma. Where, df is the input PySpark DataFrame. column_name is the column to get the average value.

How do you get the sum of count in PySpark?

PySpark – sumDistinct() sumDistinct() in PySpark returns the distinct total (sum) value from a particular column in the DataFrame. It will return the sum by considering only unique values. It will not take duplicate values to form a sum.


1 Answers

For the same column:

from pyspark.sql import functions as F
df.groupBy("Profession").agg(F.mean('Age'), F.count('Age')).show()

If you're able to use different columns:

df.groupBy("Profession").agg({'Age':'avg', 'Gender':'count'}).show()
like image 128
Pierre Gourseaud Avatar answered Sep 27 '22 19:09

Pierre Gourseaud