Typo in `dplyr::percent_rank`?

Question

In trying to understand what 'percent_rank' does, I took a peek at the code and found the expression length(!is.na(x)). I can't think of any situation in which length(x) != length(!x) so I'm wondering if this is a typo (perhaps it should be sum(!is.na(x))?) or if here really is such a situation??

Sam Firke · Accepted Answer

This is a bug. This code was introduced in this commit, which is a stated response to issue 774. Hadley writes in issue #774:

Oh oops, I'd say that's a bug in my R implementation. The denominator should be the number of non-NAs, not the length. (emphasis added)

We don't need options to control behave, just ensure that NAs in input are NA in output

But as you note, it should have been sum(!is.na(x)) not length to implement the intended fix.

Tomas Greif · Answer

I think it is a matter of preference. Even the help says:

x a vector of values to rank. Missing values are left as is. If you want to treat them as the smallest or largest values, replace with Inf or -Inf before ranking.

If you decompose percent_rank to individual elements and apply to a sample vector you get:

> x <- c(1, 1, 2, 3, NA)
> left  <- rank(x, ties.method = "min", na.last = "keep") - 1
> right <- length(!is.na(x)) - 1
> out   <- left/right
> out
[1] 0.00 0.00 0.50 0.75   NA
> x[is.na(x)] <- Inf
> left  <- rank(x, ties.method = "min", na.last = "keep") - 1
> right <- length(!is.na(x)) - 1
> out   <- left/right
> out
[1] 0.00 0.00 0.50 0.75 1.00

I am fine with how the function works right now. You just need to make sure that NA is changed to Inf/-Inf if you would like to get always 0-1 range. What I am not sure about, if this matches SQL2003 standard.

Typo in `dplyr::percent_rank`?

Tags:

r

dplyr

Jthorpe

2 Answers

Sam Firke

Tomas Greif

Recent Activity

Donate For Us

Typo in `dplyr::percent_rank`?

Tags:

r

dplyr

Jthorpe

2 Answers

Sam Firke

Tomas Greif

Related questions

Recent Activity

Donate For Us