It appears that selecting column(s) from the data.table with [.data.table
results in a copy of the underlying vector(s). I am talking about very simple column selection, by name, there are no expressions to compute in j
and there are no rows to subset in i
. Even more strangely, the column subsetting in a data.frame appears to not make any copies. I am using the data.table version data.table 1.10.4. A simple example with details and benchmarks is provided below. My questions are:
The intended use-case involves large dataset, so avoiding extra copies is a must (especially since base R seems to already support this).
library(data.table)
set.seed(12345)
cpp_dt <- data.table(a = runif(1e6), b = rnorm(1e6), c = runif(1e6))
cols=c("a","c")
## naive / data.frame style of column selection
## leads to a copy of the column vectors in cols
subset_cols_1=function(dt,cols){
return(dt[,cols,with=F])
}
## alternative syntax, still results in a copy
subset_cols_2=function(dt,cols){
return(dt[,..cols])
}
## work-around that uses data.frame column selection,
## appears to avoid the copy
subset_cols_3=function(dt,cols){
setDF(dt)
subset=dt[,cols]
setDT(subset)
setDT(dt)
return(subset)
}
## another approach that makes a "shallow" copy of the data.table
## then NULLs the not needed columns by reference
## appears to also avoid the copy
subset_cols_4=function(dt,cols){
subset=dt[TRUE]
other_cols=setdiff(names(subset),cols)
set(subset,j=other_cols,value=NULL)
return(subset)
}
subset_1=subset_cols_1(cpp_dt,cols)
subset_2=subset_cols_2(cpp_dt,cols)
subset_3=subset_cols_3(cpp_dt,cols)
subset_4=subset_cols_4(cpp_dt,cols)
Now lets look at the memory allocation and compare to original data.
.Internal(inspect(cpp_dt)) # original data, keep an eye on 1st and 3d vector
# @7fe8ba278800 19 VECSXP g1c7 [OBJ,MARK,NAM(2),ATT] (len=3, tl=1027)
# @10e2ce000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
# @10f1a3000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) -0.947317,-0.636669,0.167872,-0.206986,0.411445,...
# @10f945000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]
Using [.data.table
method to subset the columns:
.Internal(inspect(subset_1)) # looks like data.table is making a copy
# @7fe8b9f3b800 19 VECSXP g0c7 [OBJ,NAM(1),ATT] (len=2, tl=1026)
# @114cb0000 14 REALSXP g0c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
# @1121ca000 14 REALSXP g0c7 [NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]
Another syntax version that still uses [.data.table
and still making a copy:
.Internal(inspect(subset_2)) # same, still copy
# @7fe8b6402600 19 VECSXP g0c7 [OBJ,NAM(1),ATT] (len=2, tl=1026)
# @115452000 14 REALSXP g0c7 [NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
# @1100e7000 14 REALSXP g0c7 [NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]
Using a sequence of setDF
, followed by [.data.frame
and setDT
. Look, the vectors a
and c
are no longer copied! It appears that base R method is more efficient / has smaller memory footprint?
.Internal(inspect(subset_3)) # "[.data.frame" is not making a copy!!
# @7fe8b633f400 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=1026)
# @10e2ce000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
# @10f945000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]
Another approach is to make a shallow copy of the data.table, then NULL all the extra columns by reference in the new data.table. Again no copies are made.
.Internal(inspect(subset_4)) # 4th approach seems to also avoid the copy
# @7fe8b924d800 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=1027)
# @10e2ce000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
# @10f945000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]
Now lets look at the benchmarks of these four approaches. It looks like "[.data.frame" (subset_cols_3
) is a clear winner.
microbenchmark({subset_cols_1(cpp_dt,cols)},
{subset_cols_2(cpp_dt,cols)},
{subset_cols_3(cpp_dt,cols)},
{subset_cols_4(cpp_dt,cols)},
times=100)
# Unit: microseconds
# expr min lq mean median uq max neval
# { subset_cols_1(cpp_dt, cols) } 4772.092 5128.7395 8956.7398 7149.447 10189.397 53117.358 100
# { subset_cols_2(cpp_dt, cols) } 4705.383 5107.1690 8977.1816 6680.666 9206.164 53523.191 100
# { subset_cols_3(cpp_dt, cols) } 148.659 177.9595 285.4926 250.620 283.414 4422.968 100
# { subset_cols_4(cpp_dt, cols) } 193.912 241.9010 531.8308 336.467 384.844 20061.864 100
table way. Unlike data. frame, the := operator adds a column to both the object living in the global environment and used in the function.
The setDT() method can be used to coerce the dataframe or the lists into data. table, where the conversion is made to the original dataframe.
data.table is an R package that provides an enhanced version of data.frame s, which are the standard data structure for storing data in base R. In the Data section above, we already created a data.table using fread() . We can also create one using the data.table() function.
It's been a while since I thought about this, but here goes.
Good question. But why do you need to subset a data.table
like that? We really need to see what you are doing next: the bigger picture. It's that bigger picture that we probably have a different way for in data.table than the base R idiom.
Roughly illustrating with probably a bad example :
DT[region=="EU", lapply(.SD, sum), .SDcols=10:20]
rather than the base R idiom of taking a subset and then doing something next (here, apply
) on the result outside :
apply(DT[DT$region=="EU", 10:20], 2, sum)
In general, we want to encourage doing as much as possible inside one [...]
so that data.table sees the i
, j
and by
together in one [...]
operation and can optimize the combination. When you subset columns and then do the next thing outside afterwards it requires more software complexity to optimize. In most cases, most of the computational cost is inside the first [...]
which reduces to a relatively insignificant size.
With that said, in addition to Frank's comment about shallow
, we're also waiting to see how the ALTREP project pans out. That improves reference counting in base R and may enable :=
to know reliably whether a column it is operating on needs to be copy-on-write first or not. Currently,:=
always updates by reference so it would update both data.table's if selecting-some-whole-columns did not take a deep copy (it is deliberate that it does copy, for that reason). If :=
is not used inside [...]
then [...]
always returns a new result which is safe to use :=
on, which is quite a straightforward rule currently. Even if all you're doing is selecting a few whole columns for some reason.
We really need to see the bigger picture please: what you're doing afterwards on the subset of columns. Having that clear would help to raise the priority in either investigating ALTREP or perhaps doing our own reference count for this case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With