use of frequency argument in percentile function in spark sql

Question

I'm trying to use the percentile function in spark-SQL.

Data:

If I use the code below the value I get of percentile is incorrect.

select percentile('col1', .05) from tblname

output: 106.9

If I use the code below the value I get of percentile is incorrect.

select percentile('col1', .05, 2) from tblname

output: 24.91000000000001

But if I use the below code I get the expected reply (but I don't know why and how)

select percentile('col1', .05, 100) from tblname

Output: 15.8

Can anyone help me understand how the last argument changes things? Any documentation? I checking out spark source code docstring (as I'm not aware of scala) but no luck. Nothing on the official website either.

percentile(col, percentage [, frequency]) - Returns the exact percentile value > of numeric column col at the given percentage. The value of percentage must be > between 0.0 and 1.0. The value of frequency should be positive integral

Link

bluesmoon · Accepted Answer

The frequency argument specifies how many times an element should be counted, so when you specify frequency 100, each element is counted 100 times.

This allows each distinct percentile value to have a specific item it can map to, which removes the need for interpolation.

Note, that you can always find a percentile that will result in interpolation, giving you an incorrect value. For example, in your case, try to get percentile 0.0901, ie, 9.01 percentile.

use of frequency argument in percentile function in spark sql

Tags:

sql

statistics

apache-spark-sql

percentile

Vijay Jangir

1 Answers

bluesmoon

Recent Activity

Donate For Us

use of frequency argument in percentile function in spark sql

Tags:

sql

statistics

apache-spark-sql

percentile

Vijay Jangir

1 Answers

bluesmoon

Related questions

Recent Activity

Donate For Us