Suppose we've a vector
(or a data.frame
for that matter) as follows:
set.seed(1)
x <- sample(10, 1e6, TRUE)
And one wants to get all values of x
where x > 4
, say:
a1 <- x[x > 4] # (or)
a2 <- x[which(x > 4)]
identical(a1, a2) # TRUE
I think most people would prefer x[x > 4]
. But surprisingly (at least to me), subsetting using which
is faster!
require(microbenchmark)
microbenchmark(x[x > 4], x[which(x > 4)], times = 100)
Unit: milliseconds
expr min lq median uq max neval
x[x > 4] 56.59467 57.70877 58.54111 59.94623 104.51472 100
x[which(x > 4)] 26.62217 27.64490 28.31413 29.97908 99.68973 100
It's about 2.1 times faster on mine.
One possibility for the difference, I thought, could be due to the fact that which
doesn't consider NA
but >
returns them as well. But then logical operation itself should be the reason for this difference, which is not the case (obviously). That is:
microbenchmark(x > 4, which(x > 4), times = 100)
Unit: milliseconds
expr min lq median uq max neval
x > 4 8.182576 10.06163 12.68847 14.64203 60.83536 100
which(x > 4) 18.579746 19.94923 21.43004 23.75860 64.20152 100
Using which
is about 1.7 times slower just before subsetting. But which
seems to catch up drastically on/during subsetting.
It seems not possible to use my usual weapon of choice debugonce
(thanks to @GavinSimpson) as which
calls .Internal(which(x))
whereas ==
calls .Primitive("==")
.
My question therefore is why is [
on numeric
type resulting from which
faster than logical vector resulting from >
? Any ideas?
I think I should move out of the comments and add an answer. This is my hunch building up on what the others have answered and discussed. (I'm sure the real answer exists in the C source for subset_dflt.)
Once I have a vector x
and a logical vector x > 0
, I can subset x
on x > 0
in two ways. I can use which
or I can use the vector x > 0
directly as the indexing. However, we must note that the two are not identical since x[x > 0]
will preserve NA
s while x[which(x > 0)]
will not.
However, in either method, I will need to examine each element of the vector x > 0
. In an explicit which
call I will have to examine only the boolean state of the element while in a direct sub-setting operation I will have to examine both missing-ness and the boolean state of each element.
@flodel brings an interesting observation. Since [
, is.na
, which
, and |
are all primitives or internal routines, let's assume no extraordinary overhead and do this experiment:
microbenchmark(which(x > 0), x[which(x > 0)], x > 0 | is.na(x), x[x > 0],
unit="us", times=1000)
Unit: microseconds
expr min lq median uq max neval
which(x > 0) 1219.274 1238.693 1261.439 1900.871 23085.57 1000
x[which(x > 0)] 1554.857 1592.543 1974.370 2339.238 23816.99 1000
x > 0 | is.na(x) 3439.191 3459.296 3770.260 4194.474 25234.70 1000
x[x > 0] 3838.455 3876.816 4267.261 4621.544 25734.53 1000
Considering median values, we can see that, assuming x > 0 | is.na(x)
is a crude model of what I am saying happens in logical sub-setting, then the actual time taken in 'subset' is ~ 500 us. And the time taken in 'subset' with which is ~ 700 us. Both the numbers are comparable and indicate that it is not the 'subset'ing itself which is costly in one method or another. In stead, it is what is being done to compute the subset wanted that is cheaper in the which
method.
Here's my take on it. Subsetting on a numeric allows pulling out exactly those elements that are required. Subsetting on a logical requires examining each element of the index vector to see if it's TRUE
, and then building an internal list of the required elements of the target vector. There are two steps involved, so will take longer.
The difference is biggest is the number of elements extracted is small relative to the size of the original vector. For example:
> z <- rnorm(1e8)
> system.time(z[which(z < -5)])
user system elapsed
0.58 0.03 0.60
> system.time(z[z < -5])
user system elapsed
2.56 0.14 2.70
> system.time(z[which(z < 5)])
user system elapsed
1.39 0.30 1.68
> system.time(z[z < 5])
user system elapsed
2.82 0.44 3.26
Here, if you're pulling out only a small proportion of elements (there were 23 elements of z < -5 in my test), using which
takes a very small proportion compared to logical indexing. However, if you're extracting a large proportion of elements, the times are closer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With