How can I calculate exact median with Apache Spark?

2 Answers

You need to sort RDD and take element in the middle or average of two elements. Here is example with RDD[Int]:

  import org.apache.spark.SparkContext._

  val rdd: RDD[Int] = ???

  val sorted = rdd.sortBy(identity).zipWithIndex().map {
    case (v, idx) => (idx, v)
  }

  val count = sorted.count()

  val median: Double = if (count % 2 == 0) {
    val l = count / 2 - 1
    val r = l + 1
    (sorted.lookup(l).head + sorted.lookup(r).head).toDouble / 2
  } else sorted.lookup(count / 2).head.toDouble

117

answered Sep 29 '22 14:09

Eugene Zhulenev

Using Spark 2.0+ and the DataFrame API you can use the approxQuantile method：

def approxQuantile(col: String, probabilities: Array[Double], relativeError: Double)

It will also work on multiple columns at the same time since Spark version 2.2. By setting probabilites to Array(0.5) and relativeError to 0, it will compute the exact median. From the documentation:

The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive.

Despite this, there seems to be some issues with the precision when setting relativeError to 0, see the question here. A low error close to 0 will in some instances work better (will depend on Spark version).

A small working example which calculates the median of the numbers from 1 to 99 (both inclusive) and uses a low relativeError:

val df = (1 to 99).toDF("num")
val median = df.stat.approxQuantile("num", Array(0.5), 0.001)(0)
println(median)

The median returned is 50.0.

answered Sep 29 '22 15:09

Shaido

Related questions
                            
                                Override final method
                            
                                SBT is unable to find credentials when attempting to download from an Artifactory virtual repo
                            
                                Why "could not find implicit" error in Scala + Intellij + ScalaTest + Scalactic but not from sbt
                            
                                Type parameter does not extend given type
                            
                                Intellij Idea setup for Scala, clarification needed
                            
                                Understanding the limits of Scala GADT support
                            
                                What are advantages of a Twitter Future over a Scala Future?
                            
                                Declare a Function `type` with `implicit` parameters
                            
                                Scala: Implicit parameter resolution precedence
                            
                                Why has Scala no type-safe equals method?
                            
                                How to transpose an RDD in Spark
                            
                                Is it possible to access estimator attributes in spark.ml pipelines?
                            
                                What are "resources" folders in SBT projects for?
                            
                                Meaning of _2 sign in scala language
                            
                                Functional Programming + Domain-Driven Design
                            
                                Compiling sub projects in sbt
                            
                                Slick - Filter Row if Column is Null
                            
                                Usage of spark DataFrame "as" method
                            
                                Can I check whether a lazy val has been evaluated in Scala?
                            
                                Scala case classes with Mixin traits

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I calculate exact median with Apache Spark?

Tags:

scala

apache-spark

hadoop

pckmn

People also ask

2 Answers

Eugene Zhulenev

Shaido

Recent Activity

Donate For Us