Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

use of frequency argument in percentile function in spark sql

I'm trying to use the percentile function in spark-SQL.

Data:

col1
----
198
15.8
198
198
198
198
198
198
198
198
198

If I use the code below the value I get of percentile is incorrect.

select percentile('col1', .05) from tblname

output: 106.9

If I use the code below the value I get of percentile is incorrect.

select percentile('col1', .05, 2) from tblname

output: 24.91000000000001

But if I use the below code I get the expected reply (but I don't know why and how)

select percentile('col1', .05, 100) from tblname

Output: 15.8

Can anyone help me understand how the last argument changes things? Any documentation? I checking out spark source code docstring (as I'm not aware of scala) but no luck. Nothing on the official website either.

percentile(col, percentage [, frequency]) - Returns the exact percentile value > of numeric column col at the given percentage. The value of percentage must be > between 0.0 and 1.0. The value of frequency should be positive integral

Link

like image 808
Vijay Jangir Avatar asked Dec 04 '25 16:12

Vijay Jangir


1 Answers

The frequency argument specifies how many times an element should be counted, so when you specify frequency 100, each element is counted 100 times.

This allows each distinct percentile value to have a specific item it can map to, which removes the need for interpolation.

Note, that you can always find a percentile that will result in interpolation, giving you an incorrect value. For example, in your case, try to get percentile 0.0901, ie, 9.01 percentile.

like image 80
bluesmoon Avatar answered Dec 06 '25 11:12

bluesmoon



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!