I'm trying to use the percentile function in spark-SQL.
Data:
col1
----
198
15.8
198
198
198
198
198
198
198
198
198
If I use the code below the value I get of percentile is incorrect.
select percentile('col1', .05) from tblname
output: 106.9
If I use the code below the value I get of percentile is incorrect.
select percentile('col1', .05, 2) from tblname
output: 24.91000000000001
But if I use the below code I get the expected reply (but I don't know why and how)
select percentile('col1', .05, 100) from tblname
Output: 15.8
Can anyone help me understand how the last argument changes things? Any documentation? I checking out spark source code docstring (as I'm not aware of scala) but no luck. Nothing on the official website either.
percentile(col, percentage [, frequency]) - Returns the exact percentile value > of numeric column col at the given percentage. The value of percentage must be > between 0.0 and 1.0. The value of frequency should be positive integral
Link
The frequency argument specifies how many times an element should be counted, so when you specify frequency 100, each element is counted 100 times.
This allows each distinct percentile value to have a specific item it can map to, which removes the need for interpolation.
Note, that you can always find a percentile that will result in interpolation, giving you an incorrect value. For example, in your case, try to get percentile 0.0901, ie, 9.01 percentile.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With