I would like to groupby a pyspark dataframe and compute the variance of a specific column. For the average this is quite easy and can be done like this
from pyspark.sql import functions as func
AVERAGES=df.groupby('country').agg(func.avg('clicks').alias('avg_clicks')).collect()
however for the variance there seem not to be any aggregation function in the function sub-module (I am also wondering why since this is quite a common operation)
How to get variance for a Pyspark dataframe column? Pass the column name as a parameter to the variance() function. You can similarly use the variance_samp() function to get the sample variance and the variance_pop() function to get the population variance. Both the functions are available in the same pyspark.
By using the stddev_samp () method, we can get the standard deviation from the column, and finally, we can use the collect() method to get the standard deviation of a sample from the column. Where, df is the input PySpark DataFrame. column_name is the column to get the standard deviation of a sample.
1. PySpark GroupBy Agg is a function in PySpark data model that is used to combine multiple Agg functions together and analyze the result. 2. PySpark GroupBy Agg can be used to compute aggregation and analyze the data model easily at one computation.
What you can do is convert the dataframe to an RDD object and then use the provided variance function for RDDs.
df1 = df.groupby('country').agg(func.avg('clicks').alias('avg_clicks'))
rdd = df1.rdd
rdd.variance()
As standard deviation is square root of variance a pure PySpark dataframe solution is :
df = sc.parallelize(((.1, 2.0), (.3, .2))).toDF()
df.show()
varianceDF = df.select(stddev('_1') * stddev('_1'))
varianceDF.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With