Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark dataframe, groupby and compute variance of a column

I would like to groupby a pyspark dataframe and compute the variance of a specific column. For the average this is quite easy and can be done like this

from pyspark.sql import functions as func
AVERAGES=df.groupby('country').agg(func.avg('clicks').alias('avg_clicks')).collect()

however for the variance there seem not to be any aggregation function in the function sub-module (I am also wondering why since this is quite a common operation)

like image 426
Luca Fiaschi Avatar asked Aug 12 '15 09:08

Luca Fiaschi


People also ask

How do you find the variance in PySpark?

How to get variance for a Pyspark dataframe column? Pass the column name as a parameter to the variance() function. You can similarly use the variance_samp() function to get the sample variance and the variance_pop() function to get the population variance. Both the functions are available in the same pyspark.

How do you find the standard deviation of a column in PySpark?

By using the stddev_samp () method, we can get the standard deviation from the column, and finally, we can use the collect() method to get the standard deviation of a sample from the column. Where, df is the input PySpark DataFrame. column_name is the column to get the standard deviation of a sample.

What does .AGG do in PySpark?

1. PySpark GroupBy Agg is a function in PySpark data model that is used to combine multiple Agg functions together and analyze the result. 2. PySpark GroupBy Agg can be used to compute aggregation and analyze the data model easily at one computation.


2 Answers

What you can do is convert the dataframe to an RDD object and then use the provided variance function for RDDs.

df1 = df.groupby('country').agg(func.avg('clicks').alias('avg_clicks'))
rdd = df1.rdd
rdd.variance()
like image 162
Jared Avatar answered Oct 18 '22 14:10

Jared


As standard deviation is square root of variance a pure PySpark dataframe solution is :

df = sc.parallelize(((.1, 2.0), (.3, .2))).toDF()
df.show()
varianceDF = df.select(stddev('_1') * stddev('_1'))
varianceDF.show()
like image 29
blue-sky Avatar answered Oct 18 '22 15:10

blue-sky