Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tricks to avoid duplication of memory allocation when returning data.table or data.frame in R?

I have created a function that is called to read in and then return a data.table:

read.in.data <- function(filename)
{
    library(data.table)
    data.holder<-read.table(filename, skip=1)
    return(data.table(data.holder))
}

I have noticed from observing my RAM as the function processes that R seems to process this in 2 steps (or at least this is my best guess for what's going on). For example, when I load a 1.5 GB file (15 columns with a total of 136 characters per row), R seems to 1) read in the data and use 1.5 GB of RAM, and then 2) use another 1.5 GB of RAM for the return.

Are there some tricks to creating a function to create a data.table (or data.frame for that matter) and return the data.table without requiring duplication in memory? Or must I do all processing for the data.table within the function where the table is created?

Observations: If I run this code twice in a row, the memory is not cleared; since I only have 8 GB of RAM, the function fails. If I skip the step of storing the "read.table" in a variable (as shown below), I don't get any benefit. I wouldn't want to do this any way, since I'd like to have the ability to clean up the data.table before returning it. A fix to my problem would also enable me to process larger files without running out of memory.

short.read.trk <- function(fntrk)
{
    library(data.table)
    return(data.table(read.table(fntrk, skip=1)))
}
like image 755
Docuemada Avatar asked Nov 03 '22 05:11

Docuemada


1 Answers

If memory savings is mostly what you're after, you could convert it one column at a time:

library(data.table)
read.in.data <- function(filename)
{
  data.holder <- read.table(filename, skip=1)
  dt <- data.table(data.holder[[1]])
  names(dt) <- names(data.holder)[1]
  data.holder[[1]] <- NULL

  for(n in names(data.holder)) {
    dt[, `:=`(n, data.holder[[n]]) ]
    data.holder[[n]] <- NULL
  }
  return(dt)
}

(untested)

It won't be any faster, in fact it's probably slower. But it should be less wasteful of memory.

like image 109
Ken Williams Avatar answered Nov 10 '22 04:11

Ken Williams