Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is subsetting on a "logical" type slower than subsetting on "numeric" type?

Tags:

r

subset

Suppose we've a vector (or a data.frame for that matter) as follows:

set.seed(1)
x <- sample(10, 1e6, TRUE)

And one wants to get all values of x where x > 4, say:

a1 <- x[x > 4] # (or) 
a2 <- x[which(x > 4)]

identical(a1, a2) # TRUE

I think most people would prefer x[x > 4]. But surprisingly (at least to me), subsetting using which is faster!

require(microbenchmark)
microbenchmark(x[x > 4], x[which(x > 4)], times = 100)

Unit: milliseconds
            expr      min       lq   median       uq       max neval
        x[x > 4] 56.59467 57.70877 58.54111 59.94623 104.51472   100
 x[which(x > 4)] 26.62217 27.64490 28.31413 29.97908  99.68973   100

It's about 2.1 times faster on mine.

One possibility for the difference, I thought, could be due to the fact that which doesn't consider NA but > returns them as well. But then logical operation itself should be the reason for this difference, which is not the case (obviously). That is:

microbenchmark(x > 4, which(x > 4), times = 100)

Unit: milliseconds
         expr       min       lq   median       uq      max neval
        x > 4  8.182576 10.06163 12.68847 14.64203 60.83536   100
 which(x > 4) 18.579746 19.94923 21.43004 23.75860 64.20152   100

Using which is about 1.7 times slower just before subsetting. But which seems to catch up drastically on/during subsetting.

It seems not possible to use my usual weapon of choice debugonce (thanks to @GavinSimpson) as which calls .Internal(which(x)) whereas == calls .Primitive("==").

My question therefore is why is [ on numeric type resulting from which faster than logical vector resulting from >? Any ideas?

like image 762
Arun Avatar asked Jul 07 '13 09:07

Arun


2 Answers

I think I should move out of the comments and add an answer. This is my hunch building up on what the others have answered and discussed. (I'm sure the real answer exists in the C source for subset_dflt.)

Once I have a vector x and a logical vector x > 0, I can subset x on x > 0 in two ways. I can use which or I can use the vector x > 0 directly as the indexing. However, we must note that the two are not identical since x[x > 0] will preserve NAs while x[which(x > 0)] will not.

However, in either method, I will need to examine each element of the vector x > 0. In an explicit which call I will have to examine only the boolean state of the element while in a direct sub-setting operation I will have to examine both missing-ness and the boolean state of each element.

@flodel brings an interesting observation. Since [, is.na, which, and | are all primitives or internal routines, let's assume no extraordinary overhead and do this experiment:

microbenchmark(which(x > 0), x[which(x > 0)], x > 0 | is.na(x), x[x > 0],
               unit="us", times=1000)

Unit: microseconds
             expr      min       lq   median       uq      max neval
     which(x > 0) 1219.274 1238.693 1261.439 1900.871 23085.57  1000
  x[which(x > 0)] 1554.857 1592.543 1974.370 2339.238 23816.99  1000
 x > 0 | is.na(x) 3439.191 3459.296 3770.260 4194.474 25234.70  1000
         x[x > 0] 3838.455 3876.816 4267.261 4621.544 25734.53  1000

Considering median values, we can see that, assuming x > 0 | is.na(x) is a crude model of what I am saying happens in logical sub-setting, then the actual time taken in 'subset' is ~ 500 us. And the time taken in 'subset' with which is ~ 700 us. Both the numbers are comparable and indicate that it is not the 'subset'ing itself which is costly in one method or another. In stead, it is what is being done to compute the subset wanted that is cheaper in the which method.

like image 107
asb Avatar answered Oct 07 '22 05:10

asb


Here's my take on it. Subsetting on a numeric allows pulling out exactly those elements that are required. Subsetting on a logical requires examining each element of the index vector to see if it's TRUE, and then building an internal list of the required elements of the target vector. There are two steps involved, so will take longer.

The difference is biggest is the number of elements extracted is small relative to the size of the original vector. For example:

> z <- rnorm(1e8)
> system.time(z[which(z < -5)])
   user  system elapsed 
   0.58    0.03    0.60 
> system.time(z[z < -5])
   user  system elapsed 
   2.56    0.14    2.70 
> system.time(z[which(z < 5)])
   user  system elapsed 
   1.39    0.30    1.68 
> system.time(z[z < 5])
   user  system elapsed 
   2.82    0.44    3.26 

Here, if you're pulling out only a small proportion of elements (there were 23 elements of z < -5 in my test), using which takes a very small proportion compared to logical indexing. However, if you're extracting a large proportion of elements, the times are closer.

like image 42
Hong Ooi Avatar answered Oct 07 '22 05:10

Hong Ooi