In the book Software for Data Analysis: Programming with R, John Chambers emphasizes that functions should generally not be written for their side effect; rather, that a function should return a value without modifying any variables in its calling environment. Conversely, writing good script using data.table objects should specifically avoid the use of object assignment with <-
, typically used to store the result of a function.
First, is a technical question. Imagine an R function called proc1
that accepts a data.table
object x
as its argument (in addition to, maybe, other parameters). proc1
returns NULL but modifies x
using :=
. From what I understand, proc1
calling proc1(x=x1)
makes a copy of x1
just because of the way that promises work. However, as demonstrated below, the original object x1
is still modified by proc1
. Why/how is this?
> require(data.table) > x1 <- CJ(1:2, 2:3) > x1 V1 V2 1: 1 2 2: 1 3 3: 2 2 4: 2 3 > proc1 <- function(x){ + x[,y:= V1*V2] + NULL + } > proc1(x1) NULL > x1 V1 V2 y 1: 1 2 2 2: 1 3 3 3: 2 2 4 4: 2 3 6 >
Furthermore, it seems that using proc1(x=x1)
isn't any slower than doing the procedure directly on x, indicating that my vague understanding of promises are wrong and that they work in a pass-by-reference sort of way:
> x1 <- CJ(1:2000, 1:500) > x1[, paste0("V",3:300) := rnorm(1:nrow(x1))] > proc1 <- function(x){ + x[,y:= V1*V2] + NULL + } > system.time(proc1(x1)) user system elapsed 0.00 0.02 0.02 > x1 <- CJ(1:2000, 1:500) > system.time(x1[,y:= V1*V2]) user system elapsed 0.03 0.00 0.03
So, given that passing a data.table argument to a function doesn't add time, that makes it possible to write procedures for data.table objects, incorporating both the speed of data.table and the generalizability of a function. However, given what John Chambers said, that functions should not have side-effects, is it really "ok" to write this type of procedural programming in R? Why was he arguing that side effects are "bad"? If I'm going to ignore his advice, what sort of pitfalls should I be aware of? What can I do to write "good" data.table procedures?
A data table is a range of cells in which you can change values in some of the cells and come up with different answers to a problem. A good example of a data table employs the PMT function with different loan amounts and interest rates to calculate the affordable amount on a home mortgage loan.
Yes, the addition, modification, deletion of columns in data.table
s is done by reference
. In a sense, it is a good thing because a data.table
usually holds a lot of data, and it would be very memory and time consuming to reassign it all every time a change to it is made. On the other hand, it is a bad thing because it goes against the no-side-effect
functional programming approach that R tries to promote by using pass-by-value
by default. With no-side-effect programming, there is little to worry about when you call a function: you can rest assured that your inputs or your environment won't be affected, and you can just focus on the function's output. It's simple, hence comfortable.
Of course it is ok to disregard John Chambers's advice if you know what you are doing. About writing "good" data.tables procedures, here are a couple rules I would consider if I were you, as a way to limit complexity and the number of side-effects:
do.something.to(table)
and not table <- do.something.to(table)
. If instead the function had another ("real") output, then when calling result <- do.something.to(table)
, it is easy to imagine how you may focus your attention on the output and forget that calling the function had a side effect on your table.While "one output / no-side-effect" functions are the norm in R, the above rules allow for "one output or side-effect". If you agree that a side-effect is somehow a form of output, then you'll agree I am not bending the rules too much by loosely sticking to R's one-output functional programming style. Allowing functions to have multiple side-effects would be a little more of a stretch; not that you can't do it, but I would try to avoid it if possible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With