Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorising a for loop containing a which statement and a function

A reproducible example of the code I'm trying vectorise.

cutOffs <- seq(1,10,0.2)

plotOutput <- matrix(nrow=length(cutOffs), ncol=2)
colnames(plotOutput) <- c("x","y")
plotOutput[,"y"] <- cutOffs

for(plotPoint in 1:length(cutOffs))
{
  plotOutput[plotPoint, "x"] <-
    nrow(iris[ which(iris$Sepal.Length > cutOffs[plotPoint] &
                   iris$Sepal.Width > cutOffs[plotPoint]), ])
}

plotOutput

Specifically what I'm looking to find out is, if there's a way to vectorise this part.

nrow(iris[ which(iris$Sepal.Length > cutOffs[plotPoint] &
                   iris$Sepal.Width > cutOffs[plotPoint]), ])

Let's say I was to use the plyr library or some form of apply, there's probably not much speed up, which is really what I'm looking for. Fundamentally I'm trying to see if there's some technique for vectorising that I've overlooked or managed to miss while searching.

UPDATE:

Unit: milliseconds
  expr         min          lq        mean      median          uq         max neval
  op() 33663.39700 33663.39700 33663.39700 33663.39700 33663.39700 33663.39700     1
  jr()  3976.53088  3976.53088  3976.53088  3976.53088  3976.53088  3976.53088     1
  dd()  4253.21050  4253.21050  4253.21050  4253.21050  4253.21050  4253.21050     1
 exp()  5085.45331  5085.45331  5085.45331  5085.45331  5085.45331  5085.45331     1
 nic()  8719.82043  8719.82043  8719.82043  8719.82043  8719.82043  8719.82043     1
  sg()    16.66177    16.66177    16.66177    16.66177    16.66177    16.66177     1

A more realistic approximation of what I'm actually doing is this

# generate data
numObs <- 1e5
iris <- data.frame( Sepal.Length = sample(1:numObs), Sepal.Width = sample(1:numObs) )

cutOffs <- 1:(numObs*0.01)

plotOutput <- matrix(nrow=length(cutOffs), ncol=2)
colnames(plotOutput) <- c("x","y")
plotOutput[,"y"] <- cutOffs

followed by whichever particular method one prefers.

Generally speaking it'll be used on data sets with 50,000 - 200,000 points.

There was a big jump from using

sum(Sepal.Length > cutOffs[plotPoint] & Sepal.Width > cutOffs[plotPoint])

which is what I was missing as a more optimal approach in the first place.

By far however, the best answer is sgibb's sg(). The key is realising it's only lowest of the two values in each row that matters. Once that mental leap has been made there's only a single vector left to deal with and vectorising is reasonably straightforward.

# cutOff should be lower than the lowest of Sepal.Length & Sepal.Width
  m <- pmin(iris$Sepal.Length, iris$Sepal.Width)
like image 518
Henry E Avatar asked May 05 '15 13:05

Henry E


2 Answers

I like to add another answer:

sg <- function() {
  # cutOff should be lower than the lowest of Sepal.Length & Sepal.Width
  m <- pmin(iris$Sepal.Length, iris$Sepal.Width)
  ms <- sort.int(m)
  # use `findInterval` to find all the indices 
  # (equal to "how many numbers below") lower than the threshold
  plotOutput[,"x"] <- length(ms)-findInterval(cutOffs, ms)
  plotOutput
}

This approach avoids a for or outer loop and is 4x times faster than @nicola's approach:

microbenchmark(sg(), nic(), dd())
#Unit: microseconds
#  expr     min       lq     mean   median       uq      max neval
#  sg()  88.726 104.5805 127.3172 123.2895 144.2690  232.441   100
# nic() 474.315 526.7780 625.0021 602.3685 706.7530  997.412   100
#  dd() 669.841 736.7800 887.4873 847.7730 976.6445 2800.930   100

identical(sg(), dd())
# [1] TRUE
like image 168
sgibb Avatar answered Oct 23 '22 10:10

sgibb


You can use outer:

plotOutput[,"x"]<-colSums(outer(1:nrow(iris),1:length(cutOffs),function(x,y) iris$Sepal.Length[x] > cutOffs[y] & iris$Sepal.Width[x] > cutOffs[y]))
like image 42
nicola Avatar answered Oct 23 '22 10:10

nicola