Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What core packages should a professional R developer have, and why? [closed]

People also ask

What is an R package and why are R packages useful?

R packages are a collection of R functions, complied code and sample data. They are stored under a directory called "library" in the R environment. By default, R installs a set of packages during installation. More packages are added later, when they are needed for some specific purpose.

What package is need to be install for reading in R?

RMySQL, RPostgresSQL, RSQLite - If you'd like to read in data from a database, these packages are a good place to start. Choose the package that fits your type of database. XLConnect, xlsx - These packages help you read and write Micorsoft Excel files from R. You can also just export your spreadsheets from Excel as .

What are packages in RStudio?

Packages in R Programming language are a set of R functions, compiled code, and sample data. These are stored under a directory called “library” within the R environment. By default, R installs a group of packages during installation. Once we start the R console, only the default packages are available by default.


I have written way too many packages, so to keep things manageable I've invested a lot of time in infrastructure packages: packages that help me make my code more robust and help make it easier for others to use. These include:

  • roxygen2 (with Manuel Eugster and Peter Danenberg), which allows you to keep documentation next to the function it documents, which it makes it much more likely that I'll keep it up to date. roxygen2 also has a number of new features designed to minimise documentation duplication: templates (@template), parameter inheritance (@inheritParams), and function families (@family) to name a few.

  • testthat automates the testing of my code. This is becoming more and more important as I have less and less time to code: automated tests remember how the function should work, even when I don't.

  • devtools automates many common development tasks (as Andrie mentioned). The eventual goal for devtools is for it to act like R CMD check that runs continuously in the background and notifies you the instance that something goes wrong.

  • profr, particularly the unreleased interactive explorer, makes it easy for me to find bottlenecks in my code.

  • helpr (with Barret Schloerke), which will soon power http://had.co.nz/ggplot2, provides an elegant html interface to R documentation.

Useful R functions:

  • apropos: I'm always forgetting the names of useful functions, and apropos helps me find them, even if I only remember a fragment

Outside of R:

  • I use textmate to edit R (and other) files, but I don't think it's really that important. Pick one and learn all it's nooks and crannies.

  • Spend some time to learn the command line. Anything you can do to automate any part of your workflow will pay off in the long run. Running R from the command line leads to a natural process where each project has it's own instance of R; I often have 2-5 instances of R running at a time.

  • Use version control. I like git and github. Again, it doesn't matter exactly which system you use, but master it!

Things I wish R had:

  • code coverage tools
  • a dependency management framework like rake or jake
  • better memory profiling tools
  • a metadata standard for describing data frames (and other data sources)
  • better tools for describing and rendering tables in a variety of output formats
  • a package for markdown rendering

As I recall this has been asked before and my answer remains the same: Emacs.

Emacs can

  • do just about anything you want to do with R thanks to ESS, including
    • code execution of various snippets (line, region, function, buffer, ...)
    • inspection of workspaces,
    • display of variables,
    • multiple R sessions and easy switching between them
    • transcript mode for re-running (parts of) previous sessions
    • access to the help system
    • and much more
  • handles Latex with similar ease via the AucTex mode, which helps Sweave for R
  • has modes for whichever other programming languages you combine with R, be it C/C++, Python, shell, SQL, ... covering automatic indentation and colour highlighting
  • can access databases with sql-* mode
  • can work remotely with tramp mode: access remote files as if they were local (uses ssh/scp)
  • can be ran as a daemon which makes it stateful so you can reconnect to your same Emacs session, be it on the workstation under X11 (or equivalent) or remotely via ssh (with or without X11) or screen.
  • has org-mode, which together with babel, provides a powerful sweave alternative as discussed in this paper discussing workflow apps for (social) scientists
  • can run a shell via M-x shell and/or M-x eshell, has nice directory access functionality with dired mode, has ssh mode for remote access
  • interfaces all source code repositories with ease via specific modes (eg psvn for svn)
  • is cross-platform just like R so you have similar user-interface experiences on all relevant operating systems
  • is widely used, widely available and under active development for both code and extensions, see the emacswiki.org site for the latter
  • <tongueInCheek>is not Eclipse and does not require Java</tongueInCheek>

You can of course combine it with whichever CRAN packages you like: RUnit or testthat, the different profiling support packages, the debug package, ...

Additional tools that are useful:

  • R CMD check really is your friend as this is what CRAN uses to decide whether you are "in or out"; use it and trust it
  • the tests/ directory can offer a simplified version of unit tests by saving to-be-compared against output (from a prior R CMD check run), this is useful but proper unit tests are better
  • particularly for packages with object code, I prefer to launch fresh R sessions and littler makes that easy: r -lfoo -e'bar(1, "ab")' starts an R session, loads the foo package and evaluates the given expression (here a function bar() with two arguments). This, combined with R CMD INSTALL, provides a full test cycle.

Knowledge of, and ability to use, the basic R debugging tools is an essential first step in learning to quickly debug R code. If you know how to use the basic tools you can debug code anywhere without having to need all the extra tools provided in add-on packages.

traceback() allows you to see the call stack leading to an error

foo <- function(x) {
    d <- bar(x)
    x[1]
}
bar <- function(x) {
    stopifnot(is.matrix(x))
    dim(x)
}
foo(1:10)
traceback()

yields:

> foo(1:10)
Error: is.matrix(x) is not TRUE
> traceback()
4: stop(paste(ch, " is not ", if (length(r) > 1L) "all ", "TRUE", 
       sep = ""), call. = FALSE)
3: stopifnot(is.matrix(x))
2: bar(x)
1: foo(1:10)

So we can clearly see that the error happened in function bar(); we've narrowed down the scope of bug hunt. But what if the code generates warnings, not errors? That can be handled by turning warnings into errors via the warn option:

options(warn = 2)

will turn warnings into errors. You can then use traceback() to track them down.

Linked to this is getting R to recover from an error in the code so you can debug what went wrong. options(error = recover) will drop us into a debugger frame whenever an error is raised:

> options(error = recover)
> foo(1:10)
Error: is.matrix(x) is not TRUE

Enter a frame number, or 0 to exit   

1: foo(1:10)
2: bar(x)
3: stopifnot(is.matrix(x))

Selection: 2
Called from: bar(x)
Browse[1]> x
 [1]  1  2  3  4  5  6  7  8  9 10
Browse[1]> is.matrix(x)
[1] FALSE

You see we can drop into each frame on the call stack and see how the functions were called, what the arguments are etc. In the above example, we see that bar() was passed a vector not a matrix, hence the error. options(error = NULL) resets this behaviour to normal.

Another key function is trace(), which allows you to insert debugging calls into an existing function. The benefit of this is that you can tell R to debug from a particular line in the source:

> x <- 1:10; y <- rnorm(10)
> trace(lm, tracer = browser, at = 10) ## debug from line 10 of the source
Tracing function "lm" in package "stats"
[1] "lm"
> lm(y ~ x)
Tracing lm(y ~ x) step 10 
Called from: eval(expr, envir, enclos)
Browse[1]> n ## must press n <return> to get the next line step
debug: mf <- eval(mf, parent.frame())
Browse[2]> 
debug: if (method == "model.frame") return(mf) else if (method != "qr") warning(gettextf("method = '%s' is not supported. Using 'qr'", 
    method), domain = NA)
Browse[2]> 
debug: if (method != "qr") warning(gettextf("method = '%s' is not supported. Using 'qr'", 
    method), domain = NA)
Browse[2]> 
debug: NULL
Browse[2]> Q
> untrace(lm)
Untracing function "lm" in package "stats"

This allows you to insert the debugging calls at the right point in the code without having to step through the proceeding functions calls.

If you want to step through a function as it is executing, then debug(foo) will turn on the debugger for function foo(), whilst undebug(foo) will turn off the debugger.

A key point about these options is that I haven't needed to modify/edit any source code to insert debugging calls etc. I can try things out and see what the problem is directly from the session where there error has occurred.

For a different take on debugging in R, see Mark Bravington's debug package on CRAN