The code below changes values in column $type, based on values in column $weight.
n <- 1e3; m <- n*10
Treshold <- 50
wts <-runif(m)
df <- data.frame(id=seq_len(m), weight=wts * 100, type='L')
library(microbenchmark)
microbenchmark(
"df-col-row" = (df$type[df$weight < Treshold] <- "M"),
"df-row-col" = (df[df$weight < Treshold, ]$type <- "M")
)
#
#Unit: microseconds
# expr min lq mean median uq max neval
# df-col-row 80.6 87.65 145.429 89.55 104.55 5109.1 100
# df-row-col 564.9 586.10 618.496 592.40 618.90 1601.0 100
Why is the first alternative faster than the second?
Update 1
As expected, the difference increases when more columns are added.
d9 <- data.frame(type='L', weight=wts * 100, c3=3, c4=4, c5=5, c6=6, c7=7, c8=8, c9=9)
microbenchmark(
"df-row-9col" = (d9[d9$weight < Treshold, ]$type <- "M")
)
# nit: microseconds
# expr min lq mean median uq max neval
# df-row-9col 950.1 1091.55 1267.982 1111.1 1172.45 5806 100
Update 2
In the first alternative, df
is copied once, in the second alternative twice.
tracemem(df)
df$type[df$weight < Treshold] <- "M" # Alt 1.
#tracemem[0x000002c92d2b87c8 -> 0x000002c92d2b9498]: $<-.data.frame $<-
df[df$weight < Treshold, ]$type <- "M" # Alt 2.
#tracemem[0x000002c92d2b9498 -> 0x000002c92d2b9ad8]:
#tracemem[0x000002c92d2b9ad8 -> 0x000002c92d2c47d8]: [<-.data.frame [<-
untracemem(df)
Keep in mind that data.frame
s are built on list
s. For instance, if you unclass(mtcars)
, you see it’s just a list of vectors:
$mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
$cyl
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
$disp
[1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 304.0 350.0
[25] 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0
# ...
This makes it a bit more intuitive why it’s faster to select columns before filtering. It’s equivalent to selecting a subset of vectors from a list of vectors, then subsetting just those vectors. As opposed to subsetting every vector in the list, then keeping only some of them.
If you unclass()
your example data and operate on the underlying list, we get similarly disparate performance, but it’s clearer why the second operation is more expensive:
set.seed(13)
library(microbenchmark)
undf <- unclass(df)
microbenchmark(
"undf-col-row" = undf$type[undf$weight < Treshold],
"undf-row-col" = lapply(undf, \(x) x[undf$weight < Treshold])$type
)
Unit: microseconds
expr min lq mean median uq max neval cld
undf-col-row 72.18 89.1200 103.0429 101.5650 112.1550 254.341 100 a
undf-row-col 201.87 269.9105 433.0011 282.4355 310.7255 14712.019 100 b
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With