I was trying to to get the 0.8 percentile of a single column dataframe. I tried in this way:
val limit80 = 0.8
val dfSize = df.count()
val perfentileIndex = dfSize*limit80
dfSorted = df.sort()
val percentile80 = dfSorted .take(perfentileIndex).last()
But I think this will fail for big dataframes, since they may be distributed across different nodes.
Is there any better way to calculate the percentile? or how could I have all the rows of the dataframe in the same machine (even if that is very anti-pattern) so the df.take(index)
will really take into account the whole dataset and not just a partition in a node.
For Spark 2.x, you can use approxQuantile, as in the following example:
val df = Seq(
10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29
).toDF("num")
df.stat.approxQuantile("num", Array(0.8), 0.1)
// res4: Array[Double] = Array(26.0)
Note that the smaller the 3rd parameter relativeError
, the more expensive is the calculation. Here's a relevant note in the API doc:
relativeError: The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With