Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is a copy made when function returns a data.table?

Tags:

r

data.table

I am updating a set of functions that previously only accepted data.frame objects to work with data.table arguments.

I decided to implement the function using R's method dispatch so that the old code using data.frames will still work with the updated functions. In one of my functions, I take in a data.frame as input, modify it, and then return the modified data.frame. I created a data.table implementation as well. For example:

# The functions
foo <- function(d) {
  UseMethod("foo")
}

foo.data.frame <- function(d) {
  <Do Something>
  return(d)
}

foo.data.table <- function(d) {
  <Do Something>
  return(d)
}

I know that data.table works by making changes without copying, and I implemented foo.data.table while keeping that in mind. However, I return the data.table object at the end of the function because I want my old scripts to work with the new data.table objects. Will this make a copy of the data.table? How can I check? According to the documentation, one has to be very explicit to create a copy of a data.table, but I am not sure in this case.

The reason I want to return something when I do not have to with data.tables:

My old scripts look like this

someData <- read.table(...)
...
someData <- foo(someData)

I want the scripts to be able to run with data.tables by just changing the data ingest lines. In other words, I want the script to work by just changing someData <- read.table(...) to someData <- fread(...).

like image 427
ialm Avatar asked Apr 01 '14 18:04

ialm


1 Answers

Thanks to Arun for his answer in the comments. I will be using his example in his comments to answer the question.

One can check if copies are being made by using the tracemem function to track an object in R. From the help file of the function, ?tracemem, the description says:

This function marks an object so that a message is printed whenever the internal code copies the object. It is a major cause of hard-to-predict memory use in R.

For example:

# Using a data.frame
df <- data.frame(x=1:5, y=6:10)
tracemem(df)
## [1] "<0x32618220>"
df$y[2L] <- 11L
## tracemem[0x32618220 -> 0x32661a98]: 
## tracemem[0x32661a98 -> 0x32661b08]: $<-.data.frame $<- 
## tracemem[0x32661b08 -> 0x32661268]: $<-.data.frame $<- 
df
##   x  y
## 1 1  6
## 2 2 11
## 3 3  8
## 4 4  9
## 5 5 10

# Using a data.table
dt <- data.table(x=1:5, y=6:10)
tracemem(dt)
## [1] "<0x5fdab40>"
set(dt, i=2L, j=2L, value=11L) # No memory output!
address(dt) # Verify the address in memory is the same
## [1] "0x5fdab40"
dt
##    x  y
## 1: 1  6
## 2: 2 11
## 3: 3  8
## 4: 4  9
## 5: 5 10

It appears that the data.frame object is copied twice when changing one element in the data.frame, while the data.table is modified in place without making copies!

From my question, I can just track the data.table or data.frame object, d, before passing it on to the function, foo, to check if any copies were made.

like image 136
ialm Avatar answered Oct 01 '22 20:10

ialm