approxQuantile give incorrect Median in Spark (Scala)?

Question

I have this test data:

 val data = List(
        List(47.5335D),
        List(67.5335D),
        List(69.5335D),
        List(444.1235D),
        List(677.5335D)
      )

I'm expecting median to be 69.5335. But when I try to find exact median with this code:

df.stat.approxQuantile(column, Array(0.5), 0)

It gives me: 444.1235

Why is this so and how it can be fixed?

I'm doing it like this:

      val data = List(
        List(47.5335D),
        List(67.5335D),
        List(69.5335D),
        List(444.1235D),
        List(677.5335D)
      )

      val rdd = sparkContext.parallelize(data).map(Row.fromSeq(_))
      val schema = StructType(Array(
        StructField("value", DataTypes.DoubleType, false)
      ))

      val df = sqlContext.createDataFrame(rdd, schema)
      df.createOrReplaceTempView(tableName)
val df2 = sc.sql(s"SELECT value FROM $tableName")
val median = df2.stat.approxQuantile("value", Array(0.5), 0)

So I'm creating temp table. Then search inside it and then calculate result. It's just for testing.

Amir · Accepted Answer

Note that this is an approximate quantiles computation. It is not supposed to give you the exact answer all the time. See here for a more thorough explanation.

The reason is that for very large datasets, sometimes you are OK with an approximate answer, as long as you get it significantly faster than the exact computation.

approxQuantile give incorrect Median in Spark (Scala)?

Tags:

scala

apache-spark

sergeda

1 Answers

Amir

Recent Activity

Donate For Us

approxQuantile give incorrect Median in Spark (Scala)?

Tags:

scala

apache-spark

sergeda

1 Answers

Amir

Related questions

Recent Activity

Donate For Us