Calculate the standard deviation of grouped data in a Spark DataFrame

Tags:

I have user logs that I have taken from a csv and converted into a DataFrame in order to leverage the SparkSQL querying features. A single user will create numerous entries per hour, and I would like to gather some basic statistical information for each user; really just the count of the user instances, the average, and the standard deviation of numerous columns. I was able to quickly get the mean and count information by using groupBy($"user") and the aggregator with SparkSQL functions for count and avg:

val meanData = selectedData.groupBy($"user").agg(count($"logOn"), avg($"transaction"), avg($"submit"), avg($"submitsPerHour"), avg($"replies"), avg($"repliesPerHour"), avg($"duration"))

However, I cannot seem to find an equally elegant way to calculate the standard deviation. So far I can only calculate it by mapping a string, double pair and use StatCounter().stdev utility:

val stdevduration = duration.groupByKey().mapValues(value => org.apache.spark.util.StatCounter(value).stdev)

This returns an RDD however, and I would like to try and keep it all in a DataFrame for further queries to be possible on the returned data.

988

asked Aug 03 '15 14:08

the3rdNotch

1 Answers

Spark 1.6+

You can use stddev_pop to compute population standard deviation and stddev / stddev_samp to compute unbiased sample standard deviation:

import org.apache.spark.sql.functions.{stddev_samp, stddev_pop}  selectedData.groupBy($"user").agg(stdev_pop($"duration"))

Spark 1.5 and below (The original answer):

Not so pretty and biased (same as the value returned from describe) but using formula:

wikipedia sdev

you can do something like this:

import org.apache.spark.sql.functions.sqrt  selectedData     .groupBy($"user")     .agg((sqrt(         avg($"duration" * $"duration") -         avg($"duration") * avg($"duration")      )).alias("duration_sd"))

You can of course create a function to reduce the clutter:

import org.apache.spark.sql.Column def mySd(col: Column): Column = {     sqrt(avg(col * col) - avg(col) * avg(col)) }  df.groupBy($"user").agg(mySd($"duration").alias("duration_sd"))

It is also possible to use Hive UDF:

df.registerTempTable("df") sqlContext.sql("""SELECT user, stddev(duration)                   FROM df                   GROUP BY user""")

Source of the image: https://en.wikipedia.org/wiki/Standard_deviation

170

answered Sep 19 '22 14:09

zero323

Related questions
                            
                                Youtube API 3 get latest videos
                            
                                Having scss-lint ignore a particular line
                            
                                RecyclerView onClick notifyItemRemoved doesn't trigger onBindView
                            
                                UITableViewCell Accessory is too wide on iPad
                            
                                S3: User cannot access object in his own s3 bucket if created by another user
                            
                                How to load JSON file using Play with Scala
                            
                                eb deploy does not update the code
                            
                                UITableView delegate using extensions swift
                            
                                java.lang.IncompatibleClassChangeError: Implementing class with ScalaCheck and ScalaTest
                            
                                How to run a jquery function in Angular 2 after every component finish loading
                            
                                Why can't I directly access a property of an object literal?
                            
                                std::ops::Add or core::ops::Add?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With