Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark - How to calculate percentiles in Spark?

I was trying to to get the 0.8 percentile of a single column dataframe. I tried in this way:

val limit80 = 0.8
val dfSize = df.count()
val perfentileIndex = dfSize*limit80 

dfSorted = df.sort()
val percentile80 = dfSorted .take(perfentileIndex).last()

But I think this will fail for big dataframes, since they may be distributed across different nodes.

Is there any better way to calculate the percentile? or how could I have all the rows of the dataframe in the same machine (even if that is very anti-pattern) so the df.take(index) will really take into account the whole dataset and not just a partition in a node.

like image 343
Ignacio Alorre Avatar asked Nov 30 '22 21:11

Ignacio Alorre


1 Answers

For Spark 2.x, you can use approxQuantile, as in the following example:

val df = Seq(
  10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
  20, 21, 22, 23, 24, 25, 26, 27, 28, 29
).toDF("num")

df.stat.approxQuantile("num", Array(0.8), 0.1)
// res4: Array[Double] = Array(26.0)

Note that the smaller the 3rd parameter relativeError, the more expensive is the calculation. Here's a relevant note in the API doc:

relativeError: The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive.

like image 77
Leo C Avatar answered Dec 05 '22 11:12

Leo C