Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writings functions (procedures) for data.table objects

Tags:

r

data.table

In the book Software for Data Analysis: Programming with R, John Chambers emphasizes that functions should generally not be written for their side effect; rather, that a function should return a value without modifying any variables in its calling environment. Conversely, writing good script using data.table objects should specifically avoid the use of object assignment with <-, typically used to store the result of a function.

First, is a technical question. Imagine an R function called proc1 that accepts a data.table object x as its argument (in addition to, maybe, other parameters). proc1 returns NULL but modifies x using :=. From what I understand, proc1 calling proc1(x=x1) makes a copy of x1 just because of the way that promises work. However, as demonstrated below, the original object x1 is still modified by proc1. Why/how is this?

> require(data.table) > x1 <- CJ(1:2, 2:3) > x1    V1 V2 1:  1  2 2:  1  3 3:  2  2 4:  2  3 > proc1 <- function(x){ + x[,y:= V1*V2] + NULL + } > proc1(x1) NULL > x1    V1 V2 y 1:  1  2 2 2:  1  3 3 3:  2  2 4 4:  2  3 6 >  

Furthermore, it seems that using proc1(x=x1) isn't any slower than doing the procedure directly on x, indicating that my vague understanding of promises are wrong and that they work in a pass-by-reference sort of way:

> x1 <- CJ(1:2000, 1:500) > x1[, paste0("V",3:300) := rnorm(1:nrow(x1))] > proc1 <- function(x){ + x[,y:= V1*V2] + NULL + } > system.time(proc1(x1))    user  system elapsed     0.00    0.02    0.02  > x1 <- CJ(1:2000, 1:500) > system.time(x1[,y:= V1*V2])    user  system elapsed     0.03    0.00    0.03  

So, given that passing a data.table argument to a function doesn't add time, that makes it possible to write procedures for data.table objects, incorporating both the speed of data.table and the generalizability of a function. However, given what John Chambers said, that functions should not have side-effects, is it really "ok" to write this type of procedural programming in R? Why was he arguing that side effects are "bad"? If I'm going to ignore his advice, what sort of pitfalls should I be aware of? What can I do to write "good" data.table procedures?

like image 694
Michael Avatar asked Dec 07 '12 02:12

Michael


People also ask

What is the function of a data table?

A data table is a range of cells in which you can change values in some of the cells and come up with different answers to a problem. A good example of a data table employs the PMT function with different loan amounts and interest rates to calculate the affordable amount on a home mortgage loan.


1 Answers

Yes, the addition, modification, deletion of columns in data.tables is done by reference. In a sense, it is a good thing because a data.table usually holds a lot of data, and it would be very memory and time consuming to reassign it all every time a change to it is made. On the other hand, it is a bad thing because it goes against the no-side-effect functional programming approach that R tries to promote by using pass-by-value by default. With no-side-effect programming, there is little to worry about when you call a function: you can rest assured that your inputs or your environment won't be affected, and you can just focus on the function's output. It's simple, hence comfortable.

Of course it is ok to disregard John Chambers's advice if you know what you are doing. About writing "good" data.tables procedures, here are a couple rules I would consider if I were you, as a way to limit complexity and the number of side-effects:

  • a function should not modify more than one table, i.e., modifying that table should be the only side-effect,
  • if a function modifies a table, then make that table the output of the function. Of course, you won't want to re-assign it: just run do.something.to(table) and not table <- do.something.to(table). If instead the function had another ("real") output, then when calling result <- do.something.to(table), it is easy to imagine how you may focus your attention on the output and forget that calling the function had a side effect on your table.

While "one output / no-side-effect" functions are the norm in R, the above rules allow for "one output or side-effect". If you agree that a side-effect is somehow a form of output, then you'll agree I am not bending the rules too much by loosely sticking to R's one-output functional programming style. Allowing functions to have multiple side-effects would be a little more of a stretch; not that you can't do it, but I would try to avoid it if possible.

like image 60
flodel Avatar answered Oct 19 '22 20:10

flodel