I have created a function that is called to read in and then return a data.table:
read.in.data <- function(filename)
{
library(data.table)
data.holder<-read.table(filename, skip=1)
return(data.table(data.holder))
}
I have noticed from observing my RAM as the function processes that R seems to process this in 2 steps (or at least this is my best guess for what's going on). For example, when I load a 1.5 GB file (15 columns with a total of 136 characters per row), R seems to 1) read in the data and use 1.5 GB of RAM, and then 2) use another 1.5 GB of RAM for the return.
Are there some tricks to creating a function to create a data.table (or data.frame for that matter) and return the data.table without requiring duplication in memory? Or must I do all processing for the data.table within the function where the table is created?
Observations: If I run this code twice in a row, the memory is not cleared; since I only have 8 GB of RAM, the function fails. If I skip the step of storing the "read.table" in a variable (as shown below), I don't get any benefit. I wouldn't want to do this any way, since I'd like to have the ability to clean up the data.table before returning it. A fix to my problem would also enable me to process larger files without running out of memory.
short.read.trk <- function(fntrk)
{
library(data.table)
return(data.table(read.table(fntrk, skip=1)))
}
If memory savings is mostly what you're after, you could convert it one column at a time:
library(data.table)
read.in.data <- function(filename)
{
data.holder <- read.table(filename, skip=1)
dt <- data.table(data.holder[[1]])
names(dt) <- names(data.holder)[1]
data.holder[[1]] <- NULL
for(n in names(data.holder)) {
dt[, `:=`(n, data.holder[[n]]) ]
data.holder[[n]] <- NULL
}
return(dt)
}
(untested)
It won't be any faster, in fact it's probably slower. But it should be less wasteful of memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With