I have a large data frame (in the order of several GB) that I'd like to convert to a data.table
. Using as.data.table
creates a copy of the data frame, which means I need available memory to be at least twice the size of the data. Is there a way to do the conversion without a copy?
Here's a simple example to demonstrate:
library(data.table) N <- 1e6 K <- 1e2 data <- as.data.frame(rep(data.frame(rnorm(N)), K)) gc(reset=TRUE) tracemem(data) data <- as.data.table(data) gc()
With output:
library(data.table) # data.table 1.8.10 For help type: help("data.table") N <- 1e6 K <- 1e2 data <- as.data.frame(rep(data.frame(rnorm(N)), K)) gc(reset=TRUE) # used (Mb) gc trigger (Mb) max used (Mb) # Ncells 303759 16.3 597831 32.0 303759 16.3 # Vcells 100442572 766.4 402928632 3074.2 100442572 766.4 tracemem(data) # [1] "<0x363fda0>" data <- as.data.table(data) # tracemem[0x363fda0 -> 0x31e4260]: copy as.data.table.data.frame as.data.table gc() # used (Mb) gc trigger (Mb) max used (Mb) # Ncells 304519 16.3 597831 32.0 306162 16.4 # Vcells 100444242 766.4 322342905 2459.3 200933219 1533.0
Method 1 : Using setDT() method The setDT() method can be used to coerce the dataframe or the lists into data. table, where the conversion is made to the original dataframe. The modification is made by reference to the original data structure.
The setDT function takes care of this issue by allowing to convert lists - both named and unnamed lists and data. frames by reference instead. That is, the input object is modified in place, no copy is being made.
This is available from v1.9.0+. From NEWS:
o Following this S.O. post, a function
setDT
is now implemented that takes alist
(named and/or unnamed),data.frame
(ordata.table
) as input and returns the same object as adata.table
by reference (without any copy). See?setDT
examples for more.
This is in accordance with data.table
naming convention - all set*
functions modifies by reference. :=
is the only other that also modifies by reference.
require(data.table) # v1.9.0+ setDT(data) # converts data which is a data.frame to data.table *by reference*
See history for older (now outdated) answers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With