In trying to understand what 'percent_rank' does,
I took a peek at the code and found the expression length(!is.na(x))
. I can't think of any situation in which length(x) != length(!x)
so I'm wondering if this is a typo (perhaps it should be sum(!is.na(x))
?) or if here really is such a situation??
This is a bug. This code was introduced in this commit, which is a stated response to issue 774. Hadley writes in issue #774:
Oh oops, I'd say that's a bug in my R implementation. The denominator should be the number of non-NAs, not the length. (emphasis added)
We don't need options to control behave, just ensure that NAs in input are NA in output
But as you note, it should have been sum(!is.na(x))
not length
to implement the intended fix.
I think it is a matter of preference. Even the help says:
x a vector of values to rank. Missing values are left as is. If you want to treat them as the smallest or largest values, replace with Inf or -Inf before ranking.
If you decompose percent_rank
to individual elements and apply to a sample vector you get:
> x <- c(1, 1, 2, 3, NA)
> left <- rank(x, ties.method = "min", na.last = "keep") - 1
> right <- length(!is.na(x)) - 1
> out <- left/right
> out
[1] 0.00 0.00 0.50 0.75 NA
> x[is.na(x)] <- Inf
> left <- rank(x, ties.method = "min", na.last = "keep") - 1
> right <- length(!is.na(x)) - 1
> out <- left/right
> out
[1] 0.00 0.00 0.50 0.75 1.00
I am fine with how the function works right now. You just need to make sure that NA
is changed to Inf/-Inf
if you would like to get always 0-1 range. What I am not sure about, if this matches SQL2003 standard.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With