A reproducible example of the code I'm trying vectorise.
cutOffs <- seq(1,10,0.2)
plotOutput <- matrix(nrow=length(cutOffs), ncol=2)
colnames(plotOutput) <- c("x","y")
plotOutput[,"y"] <- cutOffs
for(plotPoint in 1:length(cutOffs))
{
plotOutput[plotPoint, "x"] <-
nrow(iris[ which(iris$Sepal.Length > cutOffs[plotPoint] &
iris$Sepal.Width > cutOffs[plotPoint]), ])
}
plotOutput
Specifically what I'm looking to find out is, if there's a way to vectorise this part.
nrow(iris[ which(iris$Sepal.Length > cutOffs[plotPoint] &
iris$Sepal.Width > cutOffs[plotPoint]), ])
Let's say I was to use the plyr library or some form of apply, there's probably not much speed up, which is really what I'm looking for. Fundamentally I'm trying to see if there's some technique for vectorising that I've overlooked or managed to miss while searching.
UPDATE:
Unit: milliseconds
expr min lq mean median uq max neval
op() 33663.39700 33663.39700 33663.39700 33663.39700 33663.39700 33663.39700 1
jr() 3976.53088 3976.53088 3976.53088 3976.53088 3976.53088 3976.53088 1
dd() 4253.21050 4253.21050 4253.21050 4253.21050 4253.21050 4253.21050 1
exp() 5085.45331 5085.45331 5085.45331 5085.45331 5085.45331 5085.45331 1
nic() 8719.82043 8719.82043 8719.82043 8719.82043 8719.82043 8719.82043 1
sg() 16.66177 16.66177 16.66177 16.66177 16.66177 16.66177 1
A more realistic approximation of what I'm actually doing is this
# generate data
numObs <- 1e5
iris <- data.frame( Sepal.Length = sample(1:numObs), Sepal.Width = sample(1:numObs) )
cutOffs <- 1:(numObs*0.01)
plotOutput <- matrix(nrow=length(cutOffs), ncol=2)
colnames(plotOutput) <- c("x","y")
plotOutput[,"y"] <- cutOffs
followed by whichever particular method one prefers.
Generally speaking it'll be used on data sets with 50,000 - 200,000 points.
There was a big jump from using
sum(Sepal.Length > cutOffs[plotPoint] & Sepal.Width > cutOffs[plotPoint])
which is what I was missing as a more optimal approach in the first place.
By far however, the best answer is sgibb's sg(). The key is realising it's only lowest of the two values in each row that matters. Once that mental leap has been made there's only a single vector left to deal with and vectorising is reasonably straightforward.
# cutOff should be lower than the lowest of Sepal.Length & Sepal.Width
m <- pmin(iris$Sepal.Length, iris$Sepal.Width)
I like to add another answer:
sg <- function() {
# cutOff should be lower than the lowest of Sepal.Length & Sepal.Width
m <- pmin(iris$Sepal.Length, iris$Sepal.Width)
ms <- sort.int(m)
# use `findInterval` to find all the indices
# (equal to "how many numbers below") lower than the threshold
plotOutput[,"x"] <- length(ms)-findInterval(cutOffs, ms)
plotOutput
}
This approach avoids a for
or outer
loop and is 4x times faster than @nicola's approach:
microbenchmark(sg(), nic(), dd())
#Unit: microseconds
# expr min lq mean median uq max neval
# sg() 88.726 104.5805 127.3172 123.2895 144.2690 232.441 100
# nic() 474.315 526.7780 625.0021 602.3685 706.7530 997.412 100
# dd() 669.841 736.7800 887.4873 847.7730 976.6445 2800.930 100
identical(sg(), dd())
# [1] TRUE
You can use outer
:
plotOutput[,"x"]<-colSums(outer(1:nrow(iris),1:length(cutOffs),function(x,y) iris$Sepal.Length[x] > cutOffs[y] & iris$Sepal.Width[x] > cutOffs[y]))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With