Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to organize large R programs?

People also ask

How do you store codes in R?

You can create a draft of your code as you go by using an R script. An R script is just a plain text file that you save R code in. You can open an R script in RStudio by going to File > New File > R script in the menu bar. RStudio will then open a fresh script above your console pane, as shown in Figure 1-7.


The standard answer is to use packages -- see the Writing R Extensions manual as well as different tutorials on the web.

It gives you

  • a quasi-automatic way to organize your code by topic
  • strongly encourages you to write a help file, making you think about the interface
  • a lot of sanity checks via R CMD check
  • a chance to add regression tests
  • as well as a means for namespaces.

Just running source() over code works for really short snippets. Everything else should be in a package -- even if you do not plan to publish it as you can write internal packages for internal repositories.

As for the 'how to edit' part, the R Internals manual has excellent R coding standards in Section 6. Otherwise, I tend to use defaults in Emacs' ESS mode.

Update 2008-Aug-13: David Smith just blogged about the Google R Style Guide.


I like putting different functionality in their own files.

But I don't like R's package system. It's rather hard to use.

I prefer a lightweight alternative, to place a file's functions inside an environment (what every other language calls a "namespace") and attach it. For example, I made a 'util' group of functions like so:

util = new.env()

util$bgrep = function [...]

util$timeit = function [...]

while("util" %in% search())
  detach("util")
attach(util)

This is all in a file util.R. When you source it, you get the environment 'util' so you can call util$bgrep() and such; but furthermore, the attach() call makes it so just bgrep() and such work directly. If you didn't put all those functions in their own environment, they'd pollute the interpreter's top-level namespace (the one that ls() shows).

I was trying to simulate Python's system, where every file is a module. That would be better to have, but this seems OK.


This might sound a little obvious especially if you're a programmer, but here's how I think about logical and physical units of code.

I don't know if this is your case, but when I'm working in R, I rarely start out with a large complex program in mind. I usually start in one script and separate code into logically separable units, often using functions. Data manipulation and visualization code get placed in their own functions, etc. And such functions are grouped together in one section of the file (data manipulation at the top, then visualization, etc). Ultimately you want to think about how to make it easier for you to maintain your script and lower the defect rate.

How fine/coarse grained you make your functions will vary and there are various rules of thumb: e.g. 15 lines of code, or "a function should be responsible for doing one task which is identified by its name", etc. Your mileage will vary. Since R doesn't support call-by-reference, I'm usually vary of making my functions too fine grained when it involves passing data frames or similar structures around. But this may be overcompensation for some silly performance mistakes when I first started out with R.

When to extract logical units into their own physical units (like source files and bigger groupings like packages)? I have two cases. First, if the file gets too large and scrolling around among logically unrelated units is an annoyance. Second, if I have functions that can be reused by other programs. I usually start out by placing some grouped unit, say data manipulation functions, into a separate file. I can then source this file from any other script.

If you're going to deploy your functions, then you need to start thinking about packages. I don't deploy R code in production or for re-use by others for various reasons (briefly: org culture prefers other langauges, concerns about performance, GPL, etc). Also, I tend to constantly refine and add to my collections of sourced files, and I'd rather not deal with packages when I make a change. So you should check out the other package related answers, like Dirk's, for more details on this front.

Finally, I think your question isn't necessarily particular to R. I would really recommend reading Code Complete by Steve McConnell which contains a lot of wisdom about such issues and coding practices at large.


My concise answer:

  1. Write your functions carefully, identifying general enough outputs and inputs;
  2. Limit the use of global variables;
  3. Use S3 objects and, where appropriate, S4 objects;
  4. Put the functions in packages, especially when your functions are calling C/Fortran.

I believe R is more and more used in production, so the need for reusable code is greater than before. I find the interpreter much more robust than before. There is no doubt that R is 100-300x slower than C, but usually the bottleneck is concentrated around a few lines of code, which can be delegated to C/C++. I think it would be a mistake to delegate the strengths of R in data manipulation and statistical analysis to another language. In these instances, the performance penalty is low, and in any case well worth the savings in development effort. If execution time alone were the matter, we'd be all writing assembler.


I've been meaning to figure out how to write packages but haven't invested the time. For each of my mini-projects I keep all of my low-level functions in a folder called 'functions/', and source them into a separate namespace that I explicitly create.

The following lines of code will create an environment named "myfuncs" on the search path if it doesn't already exist (using attach), and populate it with the functions contained in the .r files in my 'functions/' directory (using sys.source). I usually put these lines at the top of my main script meant for the "user interface" from which high-level functions (invoking the low-level functions) are called.

if( length(grep("^myfuncs$",search()))==0 )
  attach("myfuncs",pos=2)
for( f in list.files("functions","\\.r$",full=TRUE) )
  sys.source(f,pos.to.env(grep("^myfuncs$",search())))

When you make changes you can always re-source it with the same lines, or use something like

evalq(f <- function(x) x * 2, pos.to.env(grep("^myfuncs$",search())))

to evaluate additions/modifications in the environment you created.

It's kludgey I know, but avoids having to be too formal about it (but if you get the chance I do encourage the package system - hopefully I will migrate that way in the future).

As for coding conventions, this is the only thing I've seen regarding aesthetics (I like them and loosely follow but I don't use too many curly braces in R):

http://www1.maths.lth.se/help/R/RCC/

There are other "conventions" regarding the use of [,drop=FALSE] and <- as the assignment operator suggested in various presentations (usually keynote) at the useR! conferences, but I don't think any of these are strict (though the [,drop=FALSE] is useful for programs in which you are not sure of the input you expect).


Count me as another person in favor of packages. I'll admit to being pretty poor on writing man pages and vignettes until if/when I have to (ie being released), but it makes for a real handy way to bundle source doe. Plus, if you get serious about maintaining your code, the points that Dirk brings up all come into plya.


I also agree. Use the package.skeleton() function to get started. Even if you think your code may never be run again, it may help motivate you to create more general code that could save you time later.

As for accessing the global environment, that is easy with the <<- operator, though it is discouraged.