Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In base R, why is selecting column, then filtering rows faster than vice versa: filter rows, then select column?

The code below changes values in column $type, based on values in column $weight.

n <- 1e3; m <- n*10
Treshold <- 50
wts      <-runif(m)
df  <- data.frame(id=seq_len(m), weight=wts * 100, type='L')

library(microbenchmark)
microbenchmark(
"df-col-row" = (df$type[df$weight < Treshold]   <- "M"),
"df-row-col" = (df[df$weight < Treshold, ]$type <- "M")
)
#
#Unit: microseconds
#       expr   min     lq    mean median     uq    max neval
# df-col-row  80.6  87.65 145.429  89.55 104.55 5109.1   100
# df-row-col 564.9 586.10 618.496 592.40 618.90 1601.0   100

Why is the first alternative faster than the second?

Update 1
As expected, the difference increases when more columns are added.

d9  <- data.frame(type='L', weight=wts * 100, c3=3, c4=4, c5=5, c6=6, c7=7, c8=8, c9=9)
microbenchmark(
"df-row-9col" = (d9[d9$weight < Treshold, ]$type <- "M")
)
# nit: microseconds
#         expr   min      lq     mean median      uq  max neval
# df-row-9col 950.1 1091.55 1267.982 1111.1 1172.45 5806   100

Update 2
In the first alternative, df is copied once, in the second alternative twice.

tracemem(df)
df$type[df$weight < Treshold]   <- "M"    # Alt 1.
#tracemem[0x000002c92d2b87c8 -> 0x000002c92d2b9498]: $<-.data.frame $<- 

df[df$weight < Treshold, ]$type <- "M"    # Alt 2.
#tracemem[0x000002c92d2b9498 -> 0x000002c92d2b9ad8]: 
#tracemem[0x000002c92d2b9ad8 -> 0x000002c92d2c47d8]: [<-.data.frame [<-
untracemem(df)
like image 901
clp Avatar asked Sep 03 '25 16:09

clp


1 Answers

Keep in mind that data.frames are built on lists. For instance, if you unclass(mtcars), you see it’s just a list of vectors:

$mpg
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4

$cyl
 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

$disp
 [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1 318.0 304.0 350.0
[25] 400.0  79.0 120.3  95.1 351.0 145.0 301.0 121.0

# ... 

This makes it a bit more intuitive why it’s faster to select columns before filtering. It’s equivalent to selecting a subset of vectors from a list of vectors, then subsetting just those vectors. As opposed to subsetting every vector in the list, then keeping only some of them.

If you unclass() your example data and operate on the underlying list, we get similarly disparate performance, but it’s clearer why the second operation is more expensive:

set.seed(13)

library(microbenchmark)

undf <- unclass(df)

microbenchmark(
  "undf-col-row" = undf$type[undf$weight < Treshold],
  "undf-row-col" = lapply(undf, \(x) x[undf$weight < Treshold])$type
)
Unit: microseconds
         expr    min       lq     mean   median       uq       max neval cld
 undf-col-row  72.18  89.1200 103.0429 101.5650 112.1550   254.341   100  a 
 undf-row-col 201.87 269.9105 433.0011 282.4355 310.7255 14712.019   100   b
like image 137
zephryl Avatar answered Sep 05 '25 10:09

zephryl