pyspark dataframe, groupby and compute variance of a column

I would like to groupby a pyspark dataframe and compute the variance of a specific column. For the average this is quite easy and can be done like this

from pyspark.sql import functions as func
AVERAGES=df.groupby('country').agg(func.avg('clicks').alias('avg_clicks')).collect()

however for the variance there seem not to be any aggregation function in the function sub-module (I am also wondering why since this is quite a common operation)

How do you find the variance in PySpark?

How to get variance for a Pyspark dataframe column? Pass the column name as a parameter to the variance() function. You can similarly use the variance_samp() function to get the sample variance and the variance_pop() function to get the population variance. Both the functions are available in the same pyspark.

How do you find the standard deviation of a column in PySpark?

By using the stddev_samp () method, we can get the standard deviation from the column, and finally, we can use the collect() method to get the standard deviation of a sample from the column. Where, df is the input PySpark DataFrame. column_name is the column to get the standard deviation of a sample.

What does .AGG do in PySpark?

1. PySpark GroupBy Agg is a function in PySpark data model that is used to combine multiple Agg functions together and analyze the result. 2. PySpark GroupBy Agg can be used to compute aggregation and analyze the data model easily at one computation.

What you can do is convert the dataframe to an RDD object and then use the provided variance function for RDDs.

df1 = df.groupby('country').agg(func.avg('clicks').alias('avg_clicks'))
rdd = df1.rdd
rdd.variance()

As standard deviation is square root of variance a pure PySpark dataframe solution is :

df = sc.parallelize(((.1, 2.0), (.3, .2))).toDF()
df.show()
varianceDF = df.select(stddev('_1') * stddev('_1'))
varianceDF.show()

pyspark dataframe, groupby and compute variance of a column

Tags:

python

pyspark

pyspark-sql

spark-dataframe

Luca Fiaschi

People also ask

2 Answers

Jared

blue-sky

Recent Activity

Donate For Us

pyspark dataframe, groupby and compute variance of a column

Tags:

python

pyspark

pyspark-sql

spark-dataframe

Luca Fiaschi

People also ask

2 Answers

Jared

blue-sky

Related questions

Recent Activity

Donate For Us