Much has been written here about developing a workflow in R for statistical projects. The most popular workflow seems to be Josh Reich's LCFD model. With a <code>main.R</code> containing code: <pre class="prettyprint"><code>source('load.R') source('clean.R') source('func.R') source('do.R') </code></pre> so that a single <code>source('main.R')</code> runs the entire project. Q: Is there a reason to prefer this workflow to one in which the line-by-line interpretive work done in <code>load.R</code>, <code>clean.R</code>, and <code>do.R</code> is replaced by functions which are called by <code>main.R</code>? I can't find the link now, but I had read somewhere on SO that when programming in R one must get over their desire to write everything in terms of function calls---that R was MEANT to be written is this line-by-line interpretive form. Q: Really? Why? I've been frustrated with the LCFD approach and am going to probably write everything in terms of function calls. But before doing this, I'd like to hear from the good folks of SO as to whether this is a good idea or not. EDIT: The project I'm working on right now is to (1) read in a set of financial data, (2) clean it (quite involved), (3) Estimate some quantity associated with the data using my estimator (4) Estimate that same quantity using traditional estimators (5) Report results. My programs should be written in such a way that it's a cinch to do the work (1) for different empirical data sets, (2) for simulation data, or (3) using different estimators. ALSO, it should follow literate programming and reproducible research guidelines so that it's simple for a newcomer to the code to run the program, understand what's going on, and how to tweak it.

I think that any temporary stuff created in source'd files won't get cleaned up. If I do: <pre class="prettyprint"><code>x=matrix(runif(big^2),big,big) z=sum(x) </code></pre> and source that as a file, x hangs around although I don't need it. But if I do: <pre class="prettyprint"><code>ff=function(big){ x = matrix(runif(big^2),big,big) z=sum(x) return(z) } </code></pre> and instead of source, do z=ff(big) in my script, the x matrix goes out of scope and so gets cleaned up. Functions enable neat little re-usable encapsulations and don't pollute outside themselves. In general, they don't have side-effects. Your line-by-line scripts could be using global variables and names tied to the data set in current use, which makes them unre-usable. I sometimes work line-by-line, but as soon as I get more than about five lines I see that what I have really needs making into a proper reusable function, and more often than not I do end up re-using it.

writing functions vs. line-by-line interpretation in an R workflow

Tags:

r

workflow

statistics

Much has been written here about developing a workflow in R for statistical projects. The most popular workflow seems to be Josh Reich's LCFD model. With a main.R containing code:

source('load.R')
source('clean.R')
source('func.R')
source('do.R')

so that a single source('main.R') runs the entire project.

Q: Is there a reason to prefer this workflow to one in which the line-by-line interpretive work done in load.R, clean.R, and do.R is replaced by functions which are called by main.R?

I can't find the link now, but I had read somewhere on SO that when programming in R one must get over their desire to write everything in terms of function calls---that R was MEANT to be written is this line-by-line interpretive form.

Q: Really? Why?

I've been frustrated with the LCFD approach and am going to probably write everything in terms of function calls. But before doing this, I'd like to hear from the good folks of SO as to whether this is a good idea or not.

EDIT: The project I'm working on right now is to (1) read in a set of financial data, (2) clean it (quite involved), (3) Estimate some quantity associated with the data using my estimator (4) Estimate that same quantity using traditional estimators (5) Report results. My programs should be written in such a way that it's a cinch to do the work (1) for different empirical data sets, (2) for simulation data, or (3) using different estimators. ALSO, it should follow literate programming and reproducible research guidelines so that it's simple for a newcomer to the code to run the program, understand what's going on, and how to tweak it.

496

asked Mar 20 '11 00:03

lowndrul

2 Answers

I think that any temporary stuff created in source'd files won't get cleaned up. If I do:

x=matrix(runif(big^2),big,big)
z=sum(x)

and source that as a file, x hangs around although I don't need it. But if I do:

ff=function(big){
 x = matrix(runif(big^2),big,big)
 z=sum(x)
 return(z)
}

and instead of source, do z=ff(big) in my script, the x matrix goes out of scope and so gets cleaned up.

Functions enable neat little re-usable encapsulations and don't pollute outside themselves. In general, they don't have side-effects. Your line-by-line scripts could be using global variables and names tied to the data set in current use, which makes them unre-usable.

I sometimes work line-by-line, but as soon as I get more than about five lines I see that what I have really needs making into a proper reusable function, and more often than not I do end up re-using it.

110

answered Sep 23 '22 19:09

Spacedman

I don't think there is a single answer. The best thing to do is keep the relative merits in mind and then pick an approach for that situation.

1) functions. The advantage of not using functions is that all your variables are left in the workspace and you can examine them at the end. That may help you figure out what is going on if you have problems.

On the other hand, the advantage of well designed functions is that you can unit test them. That is you can test them apart from the rest of the code making them easier to test. Also when you use a function, modulo certain lower level constructs, you know that the results of one function won't affect the others unless they are passed out and this may limit the damage that one function's erroneous processing can do to another's. You can use the debug facility in R to debug your functions and being able to single step through them is an advantage.

2) LCFD. Regarding whether you should use a decomposition of load/clean/func/do regardless of whether its done via source or functions is a second question. The problem with this decomposition regardless of whether its done via source or functions is that you need to run one just to be able to test out the next so you can't really test them independently. From that viewpoint its not the ideal structure.

On the other hand, it does have the advantage that you may be able to replace the load step independently of the other steps if you want to try it on different data and can replace the other steps independently of the load and clean steps if you want to try different processing.

3) No. of Files There may be a third question implicit in what you are asking whether everything should be in one or multiple source files. The advantage of putting things in different source files is that you don't have to look at irrelevant items. In particular if you have routines that are not being used or not relevant to the current function you are looking at they won't interrupt the flow since you can arrange that they are in other files.

On the other hand, there may be an advantage in putting everything in one file from the viewpoint of (a) deployment, i.e. you can just send someone that single file, and (b) editing convenience as you can put the entire program in a single editor session which, for example, facilitates searching since you can search the entire program using the editor's functions as you don't have to determine which file a routine is in. Also successive undo commands will allow you to move backward across all units of your program and a single save will save the current state of all modules since there is only one. (c) speed, i.e. if you are working over a slow network it may be faster to keep a single file in your local machine and then just write it out occasionally rather than having to go back and forth to the slow remote.

Note: One other thing to think about is that using packages may be superior for your needs relative to sourcing files in the first place.

answered Sep 24 '22 19:09

G. Grothendieck

Related questions
                            
                                how to cumulatively add values in one vector in R
                            
                                Round vector of numerics to integer while preserving their sum
                            
                                Classification - Usage of factor levels
                            
                                R count number of commas and string
                            
                                regex to pickout some text between parenthesis [duplicate]
                            
                                ggplot multiple grouping bar
                            
                                How to get week starting date from a date in R [duplicate]
                            
                                R error "could not find function 'multiplot' " using Cookbook example
                            
                                Find which interval row in a data frame that each element of a vector belongs in
                            
                                Splitting String based on letters case
                            
                                What is the difference between these two comparisons? [duplicate]
                            
                                Implementation of skyline query or efficient frontier
                            
                                R - count all combinations
                            
                                How to interpret lm() coefficient estimates when using bs() function for splines
                            
                                Public Amazon EC2 AMIs with R pre-installed
                            
                                package doMC NOT available for R version 3.0.0 warning in install.packages
                            
                                assign a value, if a number is in between two numbers
                            
                                How to check whether R is already installed in Ubuntu? [closed]
                            
                                In R, use lubridate to convert hms objects into seconds
                            
                                How to get top n companies from a data frame in decreasing order

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With