Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding the number of values above a value and less than a value in a df column without using a loop

Tags:

r

Consider the following:

df <- data.frame(X = c(5000, 6000, 5500, 5000, 5300))

count_above <- function(vector)
{
  counts <- vector()
  counts[1] <- 0
  for (i in 2:length(vector))
  {
    temp <- vector[1:i]
    counts <- c(counts, sum(temp < vector[i]))
  }
  return(counts)
}

This gives me the correct output:

count_above(df$X)
[1] 0 1 1 0 2

For instance, the (column) vector here is

5000
6000
5500
5000
5300

At the very top 5000, there are no values above it. So this gives value 0.

At the 6000, there is one value which is above it and is less than 6000: the 5000. So this gives value 1.

At the 5500, there are two values above it, one of which is less than 5500, so this gives value 1, and so forth.

Is there any way I can write this out without using a loop?

like image 244
Clarinetist Avatar asked Sep 18 '16 14:09

Clarinetist


4 Answers

Another approach, quite similar to aichao's solution (but a bit shorter)

X <- c(5000, 6000, 5500, 5000, 5300)
indices <- 1:length(X)
count_above <- colSums(outer(X, X, "<") & outer(indices, indices, "<"))
## [1] 0 1 1 0 2

Edit (Performance): Perhaps my idea was selected as the accepted answer because it is short and self-explaining code - but be careful to use it on large vectors! It's the slowest approach of all the solutions suggested here! Similar to that what dracodoc did, I also did a microbenchmark. But I used a random generated vector of 3000 values to get more significant run times:

count_above_loop <- function(v)
{
  counts <- integer(length = length(v))
  counts[1] <- 0
  for (i in 2:length(v))
  {
    counts[i] <- sum(v[1:(i-1)] < v[i])
  }
  return(counts)
}

count_above_outer <- function(X) {
  indices <- 1:length(X)
  colSums(outer(X, X, "<") & outer(indices, indices, "<"))
}

count_above_apply <- function(X) {
  sapply(seq_len(length(X)), function(i) sum(X[i:1] < X[i]))
}

X <- runif(3000)

microbenchmark::microbenchmark(count_above_loop(X), 
                               count_above_apply(X),
                               count_above_outer(X), times = 10)

Unit: milliseconds
                 expr       min        lq      mean    median        uq       max neval cld
  count_above_loop(X)  56.27923  58.17195  62.07571  60.08123  63.92010  77.31658    10  a 
 count_above_apply(X)  54.41776  55.07511  57.12006  57.22372  58.61982  59.95037    10  a 
 count_above_outer(X) 121.12352 125.56072 132.45728 130.08141 137.08873 154.28419    10   b

We see that the apply approach on a large vector and without the overhead of a data frame is slightly faster than the for-loop.

My outer product approach takes more than double the time.

So I would recommend to use the for-loop - it's also readable and faster. My approach might be considered if you want to have provable correct code (as this one-liner is quite near to a specification of the problem)

like image 79
Patrick Roocks Avatar answered Nov 08 '22 09:11

Patrick Roocks


Consider a running conditional count with sapply(). Though this is still a loop, it is a vectorized method:

count_above <- sapply(seq_len(nrow(df)), 
                      function(i) sum(df[i:1, c("X")] < df$X[i]))
count_above
# [1] 0 1 1 0 2
like image 25
Parfait Avatar answered Nov 08 '22 10:11

Parfait


EDIT: I should use bigger dataset for benchmark, the tiny dataset make the benchmark results a little bit misleading. See PatrickRoocks's update.

I just commented that for loop is not necessarily bad than apply family then I saw this.

I did a microbenchmark comparing a optimized for loop and the sapply method. for loop is 6 times faster. The sapply method is not a proper function, modifying it into a function taking a vector instead of assuming data frame columns could improve a little bit.

df <- data.frame(X = c(5000, 6000, 5500, 5000, 5300))

count_above <- function(v)
{
  counts <- integer(length = length(v))
  counts[1] <- 0
  for (i in 2:length(v))
  {
    counts[i] <- sum(v[1:(i-1)] < v[i])
  }
  return(counts)
}
count_above(df$X)

microbenchmark::microbenchmark(count_above(df$X), sapply(seq_len(nrow(df)), function(i) sum(df[i:1, c("X")] < df$X[i])), times = 10)

Unit: microseconds
                                                                  expr
                                                     count_above(df$X)
 sapply(seq_len(nrow(df)), function(i) sum(df[i:1, c("X")] < df$X[i]))
     min      lq     mean   median      uq     max neval cld
  38.623  41.068  65.0722  55.0010  65.512 142.757    10  a 
 262.045 269.379 368.6231 339.2905 415.067 640.934    10   b

Update:

# modify Parfait's answer into a function, taking vector instead of data frame
count_above_2 <- function(v){
  counts <- sapply(seq_len(length(v)), 
    function(i) sum(v[i:1] < v[i]))
  return(counts)
}

X <- df$X

microbenchmark::microbenchmark(count_above(X), count_above_2(X), {indices <- 1:length(X); colSums(outer(X, X, "<") & outer(indices, indices, "<"))}, times = 100)

Unit: microseconds
                                                                                        expr
                                                                              count_above(X)
                                                                            count_above_2(X)
 {     indices <- 1:length(X)     colSums(outer(X, X, "<") & outer(indices, indices, "<")) }
    min      lq     mean  median      uq     max neval cld
 21.023 23.4680 39.02878 26.1565 35.4450 144.224   100  a 
 41.067 49.3785 67.06162 53.2900 70.1565 166.712   100   b
 37.646 40.0900 66.45059 53.0450 72.8455 258.623   100   b

For loop still wins. Transfer a vector instead of df$X save time for all, so I give 3 solutions same vector to be comparable. Parfait's answer is comparable with PatrickRoocks's.

Besides performance, there is a subtle point of correctness.

OP's function and Parfait's sum(v[i:1] < v[i]) give correct answer only because v[i] < v[i] is FALSE. By definition it should use v[1:(i-1)] < v[i].

My function can be written in a more concise version like this:

count_above <- function(v)
{
  counts <- integer(length = length(v))
  for (i in 1:length(v))
  {
    counts[i] <- sum(v[1:(i-1)] < v[i])
  }
  return(counts)
}

It looks better and give correct result. This also depend on v[1] < v[1] is FALSE. It is not necessarily wrong since it is only about the first row, though I would still prefer the longer but more obvious version.

like image 20
dracodoc Avatar answered Nov 08 '22 08:11

dracodoc


Another approach (still a loop because of colSums):

xg <- expand.grid(df$X,df$X)
o <- matrix(xg$Var1 < xg$Var2, nrow=length(x))
o[lower.tri(o)] <- FALSE
count_above <- colSums(o)
##[1] 0 1 1 0 2

This will most likely not be as efficient as Parfait's answer, but it is an alternative.

like image 24
aichao Avatar answered Nov 08 '22 09:11

aichao