Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get first value that matches condition (loop too slow)

Tags:

r

I've got a lot of matrices similar to this but with thousands of rows :

r <- 10
c <- 2
set.seed(333)

m1 <- matrix(runif(r*c)+1, r, c)

> m1
          [,1]     [,2]
 [1,] 1.467001 1.393902
 [2,] 1.084598 1.474218
 [3,] 1.973485 1.891222
 [4,] 1.571306 1.665011
 [5,] 1.020119 1.736832
 [6,] 1.723557 1.911469
 [7,] 1.609394 1.637850
 [8,] 1.306719 1.864651
 [9,] 1.063510 1.287575
[10,] 1.305353 1.129959

I've got a loop that tells me, for each value of the first column, what is the index of the first value in the second column that is 10% higher like so :

result <- 1:nrow(m1)

for (i in 1:nrow(m1)){
    result[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1]
}
> result
 [1]  3  1 NA  3  1  6  3  2  1  2

I've got so much matrices that it's taking hours, and after profiling my code, the biggest time consuming task by far is this loop. What is, according to you, the fastest way to do it ?

For example, with r = 30000 :

start_time <- Sys.time()

for (i in 1:nrow(m1)){
    result[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1]
}

end_time <- Sys.time()
a <- end_time - start_time

> a
Time difference of 11.25815 secs

Thanks for you help !

like image 847
Boog Avatar asked Mar 21 '19 06:03

Boog


1 Answers

There are some shortcuts you can take here. You are looking for the first value in column 2 that is higher than some other value. This means that it is never worth looking at values that are lower than what we have previously seen in column 2.

In your example with 10 rows, that would be as follows:

> cummax(m1[, 2])
 [1] 1.393902 1.474218 1.891222 1.891222 1.891222 1.911469 1.911469 1.911469 1.911469 1.911469
> which(cummax(m1[, 2]) == m1[, 2])
[1] 1 2 3 6

And as you can see, these are the only values in your result vector.

A second optimisation that could be made is to order the first column. If you start looking for the lowest value first, and work your way up, you don't have to look through the second column each time. You only have to step to the next row there if there are no matches with the left row anymore.

This does bear the cost of sorting the matrix, but afterwards the result can be found using a single pass through both columns.

dostuff <- function(m1){
  orderColumn1 <- order(m1[, 1])

  plus.10 <- m1[, 1] * 1.1

  results <- rep(NA, length(plus.10))

  IndexColumn1 <- 1
  IndexColumn2 <- 1
  row2CurrentMax <- 0
  while(IndexColumn2 <= nrow(m1)){
    row2Current <- m1[IndexColumn2, 2]
    if(row2Current > row2CurrentMax){
      row2CurrentMax <- row2Current
      while(TRUE){
        row1Current <- plus.10[orderColumn1[IndexColumn1]]
        if(row1Current <= row2CurrentMax){
          results[orderColumn1[IndexColumn1]] <- IndexColumn2
          IndexColumn1 <- IndexColumn1 + 1
        } else {
          break
        }
      }
    }
    IndexColumn2 <- IndexColumn2 + 1
  }
  results
}

With 30000 rows:

> result <- dostuff(m1)
> end_time <- Sys.time()
> a <- end_time - start_time
> a
Time difference of 0.0600059 secs
like image 55
JAD Avatar answered Sep 27 '22 19:09

JAD