Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

writing functions vs. line-by-line interpretation in an R workflow

Much has been written here about developing a workflow in R for statistical projects. The most popular workflow seems to be Josh Reich's LCFD model. With a main.R containing code:

source('load.R')
source('clean.R')
source('func.R')
source('do.R')

so that a single source('main.R') runs the entire project.

Q: Is there a reason to prefer this workflow to one in which the line-by-line interpretive work done in load.R, clean.R, and do.R is replaced by functions which are called by main.R?

I can't find the link now, but I had read somewhere on SO that when programming in R one must get over their desire to write everything in terms of function calls---that R was MEANT to be written is this line-by-line interpretive form.

Q: Really? Why?

I've been frustrated with the LCFD approach and am going to probably write everything in terms of function calls. But before doing this, I'd like to hear from the good folks of SO as to whether this is a good idea or not.

EDIT: The project I'm working on right now is to (1) read in a set of financial data, (2) clean it (quite involved), (3) Estimate some quantity associated with the data using my estimator (4) Estimate that same quantity using traditional estimators (5) Report results. My programs should be written in such a way that it's a cinch to do the work (1) for different empirical data sets, (2) for simulation data, or (3) using different estimators. ALSO, it should follow literate programming and reproducible research guidelines so that it's simple for a newcomer to the code to run the program, understand what's going on, and how to tweak it.

like image 496
lowndrul Avatar asked Mar 20 '11 00:03

lowndrul


People also ask

Why are functions useful in R?

6.1 Functions in R. As everything else in R, functions are also first class objects (like vectors or matrices) and can be used in the same way. This allows one to pass functions as input arguments to other function which is frequently used and an important feature in R.

How do you create a function?

To create a function, we must first declare it and give it a name, the same way we'd create any variable, and then we follow it by a function definition: var sayHello = function() { }; We could put any code inside that function - one statement, multiple statements - depends on what we want to do.

How do you write in R?

To start writing a new R script in RStudio, click File – New File – R Script. Shortcut! To create a new script in R, you can also use the command–shift–N shortcut on Mac.


2 Answers

I think that any temporary stuff created in source'd files won't get cleaned up. If I do:

x=matrix(runif(big^2),big,big)
z=sum(x)

and source that as a file, x hangs around although I don't need it. But if I do:

ff=function(big){
 x = matrix(runif(big^2),big,big)
 z=sum(x)
 return(z)
}

and instead of source, do z=ff(big) in my script, the x matrix goes out of scope and so gets cleaned up.

Functions enable neat little re-usable encapsulations and don't pollute outside themselves. In general, they don't have side-effects. Your line-by-line scripts could be using global variables and names tied to the data set in current use, which makes them unre-usable.

I sometimes work line-by-line, but as soon as I get more than about five lines I see that what I have really needs making into a proper reusable function, and more often than not I do end up re-using it.

like image 110
Spacedman Avatar answered Sep 23 '22 19:09

Spacedman


I don't think there is a single answer. The best thing to do is keep the relative merits in mind and then pick an approach for that situation.

1) functions. The advantage of not using functions is that all your variables are left in the workspace and you can examine them at the end. That may help you figure out what is going on if you have problems.

On the other hand, the advantage of well designed functions is that you can unit test them. That is you can test them apart from the rest of the code making them easier to test. Also when you use a function, modulo certain lower level constructs, you know that the results of one function won't affect the others unless they are passed out and this may limit the damage that one function's erroneous processing can do to another's. You can use the debug facility in R to debug your functions and being able to single step through them is an advantage.

2) LCFD. Regarding whether you should use a decomposition of load/clean/func/do regardless of whether its done via source or functions is a second question. The problem with this decomposition regardless of whether its done via source or functions is that you need to run one just to be able to test out the next so you can't really test them independently. From that viewpoint its not the ideal structure.

On the other hand, it does have the advantage that you may be able to replace the load step independently of the other steps if you want to try it on different data and can replace the other steps independently of the load and clean steps if you want to try different processing.

3) No. of Files There may be a third question implicit in what you are asking whether everything should be in one or multiple source files. The advantage of putting things in different source files is that you don't have to look at irrelevant items. In particular if you have routines that are not being used or not relevant to the current function you are looking at they won't interrupt the flow since you can arrange that they are in other files.

On the other hand, there may be an advantage in putting everything in one file from the viewpoint of (a) deployment, i.e. you can just send someone that single file, and (b) editing convenience as you can put the entire program in a single editor session which, for example, facilitates searching since you can search the entire program using the editor's functions as you don't have to determine which file a routine is in. Also successive undo commands will allow you to move backward across all units of your program and a single save will save the current state of all modules since there is only one. (c) speed, i.e. if you are working over a slow network it may be faster to keep a single file in your local machine and then just write it out occasionally rather than having to go back and forth to the slow remote.

Note: One other thing to think about is that using packages may be superior for your needs relative to sourcing files in the first place.

like image 44
G. Grothendieck Avatar answered Sep 24 '22 19:09

G. Grothendieck