Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Typo in `dplyr::percent_rank`?

Tags:

r

dplyr

In trying to understand what 'percent_rank' does, I took a peek at the code and found the expression length(!is.na(x)). I can't think of any situation in which length(x) != length(!x) so I'm wondering if this is a typo (perhaps it should be sum(!is.na(x))?) or if here really is such a situation??

like image 978
Jthorpe Avatar asked May 12 '15 22:05

Jthorpe


2 Answers

This is a bug. This code was introduced in this commit, which is a stated response to issue 774. Hadley writes in issue #774:

Oh oops, I'd say that's a bug in my R implementation. The denominator should be the number of non-NAs, not the length. (emphasis added)

We don't need options to control behave, just ensure that NAs in input are NA in output

But as you note, it should have been sum(!is.na(x)) not length to implement the intended fix.

like image 187
Sam Firke Avatar answered Sep 21 '22 06:09

Sam Firke


I think it is a matter of preference. Even the help says:

x a vector of values to rank. Missing values are left as is. If you want to treat them as the smallest or largest values, replace with Inf or -Inf before ranking.

If you decompose percent_rank to individual elements and apply to a sample vector you get:

> x <- c(1, 1, 2, 3, NA)
> left  <- rank(x, ties.method = "min", na.last = "keep") - 1
> right <- length(!is.na(x)) - 1
> out   <- left/right
> out
[1] 0.00 0.00 0.50 0.75   NA
> x[is.na(x)] <- Inf
> left  <- rank(x, ties.method = "min", na.last = "keep") - 1
> right <- length(!is.na(x)) - 1
> out   <- left/right
> out
[1] 0.00 0.00 0.50 0.75 1.00

I am fine with how the function works right now. You just need to make sure that NA is changed to Inf/-Inf if you would like to get always 0-1 range. What I am not sure about, if this matches SQL2003 standard.

like image 37
Tomas Greif Avatar answered Sep 21 '22 06:09

Tomas Greif