Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Passing a data frame by reference

Tags:

R has pass-by-value semantics, which minimizes accidental side effects (a good thing). However, when code is organized into many functions/methods for reusability/readability/maintainability and when that code needs to manipulate large data structures through, e.g., big data frames, through a series of transformations/operations the pass-by-value semantics leads to a lot of copying of data around and much heap thrashing (a bad thing). For example, a data frame that takes 50Mb on the heap that is passed as a function parameter will be copied at a minimum the same number of times as the function call depth and the heap size at the bottom of the call stack will be N*50Mb. If the functions return a transformed/modified data frame from deep in the call chain then the copying goes up by another N.

The SO question What is the best way to avoid passing a data frame around? touches this topic but is phrased in a way that avoids directly asking the pass-by-reference question and the winning answer basically says, "yes, pass-by-value is how R works". That's not actually 100% accurate. R environments enable pass-by-reference semantics and OO frameworks such as proto use this capability extensively. For example, when a proto object is passed as a function argument, while its "magic wrapper" is passed by value, to the R developer the semantics are pass-by-reference.

It seems that passing a big data frame by reference would be a common problem and I'm wondering how others have approached it and whether there are any libraries that enable this. In my searching I have not discovered one.

If nothing is available, my approach would be to create a proto object that wraps a data frame. I would appreciate pointers about the syntactic sugar that should be added to this object to make it useful, e.g., overloading the $ and [[ operators, as well as any gotchas I should look out for. I'm not an R expert.

Bonus points for a type-agnostic pass-by-reference solution that integrates nicely with R, though my needs are exclusively with data frames.

like image 540
Sim Avatar asked Jun 26 '12 12:06

Sim


People also ask

Can you pass by reference in R?

No. Objects in assignment statements are immutable. R will copy the object not just the reference.

Does R pass by value?

Introduction. In principle R uses pass-by-value semantics in its function calls. If two different variable names x and y start out with equal values, there is (almost) no way for R-level operations on the value of x to change the value of y . Pass-by-value is usually used by functional programming languages.

What is R code for handling a data frame?

To create a data frame in R use data. frame() command and then pass each of the vectors you have created as arguments to the function. Example: R.

Can you include an R list as a column of a data frame?

Data frame columns can contain lists You can also create a data frame having a list as a column using the data. frame function, but with a little tweak. The list column has to be wrapped inside the function I.


1 Answers

The premise of the question is (partly) incorrect. R works as pass-by-promise and there is repeated copying in the manner you outline only when further assignments and alterations to the dataframe are made as the promise is passed on. So the number of copies will not be N*size where N is the stack depth, but rather where N is the number of levels where assignments are made. You are correct, however, that environments can be useful. I see on following the link that you have already found the 'proto' package. There is also a relatively recent introduction of a "reference class" sometimes referred to as "R5" where R/S3 was the original class system of S3 that is copied in R and R4 would be the more recent class system that seems to mostly support the BioConductor package development.

Here is a link to an example by Steve Lianoglou (in a thread discussing the merits of reference classes) of embedding an environment inside an S4 object to avoid the copying costs:

https://stat.ethz.ch/pipermail/r-help/2011-September/289987.html

Matthew Dowle's 'data.table' package creates a new class of data object whose access semantics using the "[" are different than those of regular R data.frames, and which is really working as pass-by-reference. It has superior speed of access and processing. It also can fall back on dataframe semantics since in later years such objects now inherit the 'data.frame' class.

You may also want to investigate Hesterberg's dataframe package.

like image 137
IRTFM Avatar answered Oct 12 '22 21:10

IRTFM