Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I calculate exact median with Apache Spark?

This page contains some statistics functions (mean, stdev, variance, etc.) but it does not contain the median. How can I calculate exact median?

like image 311
pckmn Avatar asked Jan 26 '15 21:01

pckmn


People also ask

How do you find the median of a spark?

We can define our own UDF in PySpark, and then we can use the python library np. The numpy has the method that calculates the median of a data frame. The data frame column is first grouped by based on a column value and post grouping the column whose median needs to be calculated in collected as a list of Array.

How do you find the median in SQL?

To get the median we have to use PERCENTILE_CONT(0.5). If you want to define a specific set of rows grouped to get the median, then use the OVER (PARTITION BY) clause. Here I've used PARTITION BY on the column OrderID so as to find the median of unit prices for the order ids.

How do you get Quantiles in PySpark?

In order to calculate the quantile rank , decile rank and n tile rank in pyspark we use ntile() Function. By passing argument 4 to ntile() function quantile rank of the column in pyspark is calculated. By passing argument 10 to ntile() function decile rank of the column in pyspark is calculated.

How does spark SQL work?

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data.


2 Answers

You need to sort RDD and take element in the middle or average of two elements. Here is example with RDD[Int]:

  import org.apache.spark.SparkContext._

  val rdd: RDD[Int] = ???

  val sorted = rdd.sortBy(identity).zipWithIndex().map {
    case (v, idx) => (idx, v)
  }

  val count = sorted.count()

  val median: Double = if (count % 2 == 0) {
    val l = count / 2 - 1
    val r = l + 1
    (sorted.lookup(l).head + sorted.lookup(r).head).toDouble / 2
  } else sorted.lookup(count / 2).head.toDouble
like image 117
Eugene Zhulenev Avatar answered Sep 29 '22 14:09

Eugene Zhulenev


Using Spark 2.0+ and the DataFrame API you can use the approxQuantile method:

def approxQuantile(col: String, probabilities: Array[Double], relativeError: Double)

It will also work on multiple columns at the same time since Spark version 2.2. By setting probabilites to Array(0.5) and relativeError to 0, it will compute the exact median. From the documentation:

The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive.

Despite this, there seems to be some issues with the precision when setting relativeError to 0, see the question here. A low error close to 0 will in some instances work better (will depend on Spark version).


A small working example which calculates the median of the numbers from 1 to 99 (both inclusive) and uses a low relativeError:

val df = (1 to 99).toDF("num")
val median = df.stat.approxQuantile("num", Array(0.5), 0.001)(0)
println(median)

The median returned is 50.0.

like image 30
Shaido Avatar answered Sep 29 '22 15:09

Shaido