Why does selecting column(s) from a data.table results in a copy?

Tags:

It appears that selecting column(s) from the data.table with [.data.table results in a copy of the underlying vector(s). I am talking about very simple column selection, by name, there are no expressions to compute in j and there are no rows to subset in i. Even more strangely, the column subsetting in a data.frame appears to not make any copies. I am using the data.table version data.table 1.10.4. A simple example with details and benchmarks is provided below. My questions are:

Am I doing something wrong?
Is this a bug or is this the intended behavior?
If this is intended, what is the best approach to subset a data.table by columns and avoid extra copy?

The intended use-case involves large dataset, so avoiding extra copies is a must (especially since base R seems to already support this).

library(data.table)
set.seed(12345)
cpp_dt <- data.table(a = runif(1e6), b = rnorm(1e6), c = runif(1e6))
cols=c("a","c")

## naive / data.frame style of column selection
## leads to a copy of the column vectors in cols
subset_cols_1=function(dt,cols){
  return(dt[,cols,with=F])
}

## alternative syntax, still results in a copy
subset_cols_2=function(dt,cols){
  return(dt[,..cols])
}

## work-around that uses data.frame column selection,
## appears to avoid the copy
subset_cols_3=function(dt,cols){
  setDF(dt)
  subset=dt[,cols]
  setDT(subset)
  setDT(dt)
  return(subset)
}

## another approach that makes a "shallow" copy of the data.table
## then NULLs the not needed columns by reference
## appears to also avoid the copy
subset_cols_4=function(dt,cols){
  subset=dt[TRUE]
  other_cols=setdiff(names(subset),cols)
  set(subset,j=other_cols,value=NULL)
  return(subset)
}

subset_1=subset_cols_1(cpp_dt,cols)
subset_2=subset_cols_2(cpp_dt,cols)
subset_3=subset_cols_3(cpp_dt,cols)
subset_4=subset_cols_4(cpp_dt,cols)

Now lets look at the memory allocation and compare to original data.

.Internal(inspect(cpp_dt)) # original data, keep an eye on 1st and 3d vector
# @7fe8ba278800 19 VECSXP g1c7 [OBJ,MARK,NAM(2),ATT] (len=3, tl=1027)
#   @10e2ce000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
#   @10f1a3000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) -0.947317,-0.636669,0.167872,-0.206986,0.411445,...
#   @10f945000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]

Using [.data.table method to subset the columns:

.Internal(inspect(subset_1)) # looks like data.table is making a copy
# @7fe8b9f3b800 19 VECSXP g0c7 [OBJ,NAM(1),ATT] (len=2, tl=1026)
#   @114cb0000 14 REALSXP g0c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
#   @1121ca000 14 REALSXP g0c7 [NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]

Another syntax version that still uses [.data.table and still making a copy:

.Internal(inspect(subset_2)) # same, still copy
# @7fe8b6402600 19 VECSXP g0c7 [OBJ,NAM(1),ATT] (len=2, tl=1026)
#   @115452000 14 REALSXP g0c7 [NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
#   @1100e7000 14 REALSXP g0c7 [NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]

Using a sequence of setDF, followed by [.data.frame and setDT. Look, the vectors a and c are no longer copied! It appears that base R method is more efficient / has smaller memory footprint?

.Internal(inspect(subset_3)) # "[.data.frame" is not making a copy!!
# @7fe8b633f400 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=1026)
#   @10e2ce000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
#   @10f945000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]

Another approach is to make a shallow copy of the data.table, then NULL all the extra columns by reference in the new data.table. Again no copies are made.

.Internal(inspect(subset_4)) # 4th approach seems to also avoid the copy
# @7fe8b924d800 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=1027)
#   @10e2ce000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
#   @10f945000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]

Now lets look at the benchmarks of these four approaches. It looks like "[.data.frame" (subset_cols_3) is a clear winner.

microbenchmark({subset_cols_1(cpp_dt,cols)},
               {subset_cols_2(cpp_dt,cols)},
               {subset_cols_3(cpp_dt,cols)},
               {subset_cols_4(cpp_dt,cols)},
               times=100)

# Unit: microseconds
#                                 expr      min        lq      mean   median        uq       max neval
#  {     subset_cols_1(cpp_dt, cols) } 4772.092 5128.7395 8956.7398 7149.447 10189.397 53117.358   100
#  {     subset_cols_2(cpp_dt, cols) } 4705.383 5107.1690 8977.1816 6680.666  9206.164 53523.191   100
#  {     subset_cols_3(cpp_dt, cols) }  148.659  177.9595  285.4926  250.620   283.414  4422.968   100
#  {     subset_cols_4(cpp_dt, cols) }  193.912  241.9010  531.8308  336.467   384.844 20061.864   100

965

asked Aug 25 '17 01:08

Oleg Sofrygin

1 Answers

It's been a while since I thought about this, but here goes.

Good question. But why do you need to subset a data.table like that? We really need to see what you are doing next: the bigger picture. It's that bigger picture that we probably have a different way for in data.table than the base R idiom.

Roughly illustrating with probably a bad example :

DT[region=="EU", lapply(.SD, sum), .SDcols=10:20]

rather than the base R idiom of taking a subset and then doing something next (here, apply) on the result outside :

apply(DT[DT$region=="EU", 10:20], 2, sum)

In general, we want to encourage doing as much as possible inside one [...] so that data.table sees the i, j and by together in one [...] operation and can optimize the combination. When you subset columns and then do the next thing outside afterwards it requires more software complexity to optimize. In most cases, most of the computational cost is inside the first [...] which reduces to a relatively insignificant size.

With that said, in addition to Frank's comment about shallow, we're also waiting to see how the ALTREP project pans out. That improves reference counting in base R and may enable := to know reliably whether a column it is operating on needs to be copy-on-write first or not. Currently,:= always updates by reference so it would update both data.table's if selecting-some-whole-columns did not take a deep copy (it is deliberate that it does copy, for that reason). If := is not used inside [...] then [...] always returns a new result which is safe to use := on, which is quite a straightforward rule currently. Even if all you're doing is selecting a few whole columns for some reason.

We really need to see the bigger picture please: what you're doing afterwards on the subset of columns. Having that clear would help to raise the priority in either investigating ALTREP or perhaps doing our own reference count for this case.

131

answered Sep 30 '22 16:09

Matt Dowle

Related questions
                            
                                Connect to RServe from JAVA using authentication
                            
                                CRAN notes that files cannot be checked without ‘pandoc’ being installed
                            
                                Is it safe to use "df" as the name for a dataframe?
                            
                                Best method of spatial interpolation for geographic heat/contour maps?
                            
                                notepad++ run selected code in python console seamlessly
                            
                                How to cross-reference an equation in an R help file/roxygen2
                            
                                Dynamically Generate Reference Classes
                            
                                Knitr inline chunk options (no evaluation) or just render highlighted code
                            
                                How to have conditional formatting of data frames in R Shiny?
                            
                                geom_vline() with date gives Error: Discrete value supplied to continuous scale
                            
                                How can I explicitly set column width for R DT tables using R Markdown?
                            
                                Multivariate Outlier Detection using R with probability
                            
                                Async process blocking R Shiny app
                            
                                ggplot2 + world map = old data?
                            
                                Passing along ellipsis arguments to two different functions? [duplicate]
                            
                                Have R not exit debugger if error
                            
                                Operator == inconsistent in logical columns in data.table
                            
                                Fastest way to compute the cdf of the Normal distribution over vectors - R::pnorm vs erfc vs?
                            
                                R plots some unicode characters but not others
                            
                                Animate second level bullet points in slidify

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does selecting column(s) from a data.table results in a copy?

Tags:

r

data.table

Oleg Sofrygin

People also ask

1 Answers

Matt Dowle

Recent Activity

Donate For Us