Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Time difference between different subsetting methods for data.frame and matrix objects

Consider the following benchmark (R 3.4.1 on Windows machine):

library(rbenchmark)

mtx <- matrix(runif(1e8), ncol = 100)
df <- as.data.frame(mtx)

colnames(mtx) <- colnames(df) <- paste0("V", 1:100)

benchmark(
  mtx[5000:7000, 80],
  mtx[5000:7000, "V80"],
  mtx[, "V80"][5000:7000],
  mtx[, "V80", drop = FALSE][5000:7000, ],
  mtx[5000:7000, , drop = FALSE][, "V80"],
  #mtx$V80[5000:7000], # does not apply
  replications = 5000
)

##                                      test replications elapsed relative user.self sys.self user.child sys.child
## 4 mtx[, "V80", drop = FALSE][5000:7000, ]         5000   64.71  588.273     47.44    16.61         NA        NA
## 3                 mtx[, "V80"][5000:7000]         5000   72.15  655.909     52.90    18.18         NA        NA
## 2                   mtx[5000:7000, "V80"]         5000    0.11    1.000      0.11     0.00         NA        NA
## 5 mtx[5000:7000, , drop = FALSE][, "V80"]         5000    7.47   67.909      5.89     1.47         NA        NA
## 1                      mtx[5000:7000, 80]         5000    0.13    1.182      0.12     0.00         NA        NA

benchmark(
  df[5000:7000, 80],
  df[5000:7000, "V80"],
  df[, "V80"][5000:7000],
  df[, "V80", drop = FALSE][5000:7000, ],
  df[5000:7000, , drop = FALSE][, "V80"],
  df$V80[5000:7000],
  replications = 5000
)

##                                     test replications elapsed relative user.self sys.self user.child sys.child
## 6                      df$V80[5000:7000]         5000    0.13    1.000      0.12     0.00         NA        NA
## 4 df[, "V80", drop = FALSE][5000:7000, ]         5000    0.33    2.538      0.33     0.00         NA        NA
## 3                 df[, "V80"][5000:7000]         5000    0.17    1.308      0.17     0.00         NA        NA
## 2                   df[5000:7000, "V80"]         5000    0.15    1.154      0.16     0.00         NA        NA
## 5 df[5000:7000, , drop = FALSE][, "V80"]         5000   13.63  104.846     12.91     0.39         NA        NA
## 1                      df[5000:7000, 80]         5000    0.19    1.462      0.17     0.00         NA        NA

The time difference is pretty dramatic. Why is that? What is the recommended way of subsetting and why? Given the benchmarks, the mtx[i, colname] way for matrix and df$colname[i] (but it doesn't seem to make much difference) for data.frame seem to be most time-efficient, but are there any general reasons why we should prefer any of the approaches?

like image 428
Tim Avatar asked Nov 07 '22 16:11

Tim


1 Answers

The main reason lies in the R data structures behind matrices and data.frames. A matrix is basically an object with rownumber x columnnumber (mainly numeric) entries (by R's default a matrix is not sparse) and a dimension property. For this reason, your first 2 commands

mtx[5000:7000, 80],
mtx[5000:7000, "V80"]

extract again matrices for which R does not only assign the values but also the dimension creating new matrix objects instead of simple vectors which are R's default objects.

On the other hand, a data.frame in R is by definition a special type of list object where the length of each column object has to be identical, whereas the columns may contain different types of variables (numerical, string etc.). Matrices can only contain a single types of variable which will be the most general one by default. Thus,

df[5000:7000, 80]

extracts the vector of the 80th column and then the values on position 5000-7000 out of this one. A vector is far more simple to handle for R than a matrix object and therefore, this is far quicker.

If you choose drop=FALSE, however, you force R to not work with a simple vector object when selecting the 80th column, but treat it a a data.frame/list object instead. Lists are the most general and flexible type of R objects, as there are no restraints regarding their size and entries, but this comes at the price that they are most difficult and time consuming to handle, as you can observe when comparing

mtx[5000:7000, , drop = FALSE][, "V80"]
df[5000:7000, , drop = FALSE][, "V80"]

From the data frame you obtain another data.frame/list, whereas the matrix still returns a matrix which is still faster to handle than the list.

like image 166
Alex2006 Avatar answered Nov 14 '22 22:11

Alex2006