Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Safe method for updating R packages - is "hot-swapping" possible?

I have encountered this problem a few times, and am not able to figure out any solution but the trivial one (see below).

Suppose a computer is running 2+ instances of R, due to either 2+ users or 1 user running multiple processes, and one instance executes update.packages(). I've had several times where the other instance can get fouled up big time. The packages being updated don't change functionality in any way that affects computation, but somehow a big problem arises.

The trivial solution (Solution 0) is to terminate all instances of R while update.packages() executes. This has 2+ problems. First, one has to terminate R instances. Second, one may not even be able to identify where those instances are running (see update 1).

Assuming that the behavior of the code being executed won't change (e.g. package updates are all beneficial - they only fix bugs, improve speed, reduce RAM, and grant unicorns), is there some way to hot-swap a new version of package with less impact on other processes?

I have two more candidate solutions, outside of R:

Solution 1 is to use a temporary library path and then delete the old old library and move the new one into its place. The drawback of this is that deletes + moves can incur some time during which nothing is available.

Solution 2 is to use symlinks to point to a library (or library hierarchy) and just overwrite a symlink with a pointer to a new library where the updated package resides. That seems to incur even less package downtime - the time it takes for the OS to overwrite a symlink. The downside of this is that it requires a lot more care in managing symlinks, and is platform-specific.

I suspect that solution #1 could be modified to be like #2, by clever use of .libPaths(), but this seems like one needs to not call update.packages() and instead write a new updater that finds the outdated packages, installs them to a temporary library, and then updates the library paths. The upside of this is that one could constrain an existing process to the .libPaths() it had when it started (i.e. changing the library paths R knows about might not be propagated to those instances that are already running, without some explicit intervention within that instance).


Update 1. In the example scenario, the two competing R instances are on the same machine. This is not a requirement: as far as I understand the updates, if the two share the same libraries, i.e. the same directories on a shared drive, then the update can still cause problems, even if the other instance of R is on another machine. So, one could accidentally kill an R process and not even see it.

like image 780
Iterator Avatar asked Jan 26 '12 22:01

Iterator


3 Answers

In a production environment, you probably want to keep at least two versions, the current and the previous one, in order to be able to quickly switch back to the old one in case of a problem. Nothing would be overwritten or deleted. It is easier to do that for the whole R ecosystem: you would have several directories, say "R-2.14.1-2011-12-22", "R-2.14.1-2012-01-27", etc., each containing everything (the R executables and all packages). Those directories would never be updated: if an update is needed, a new directory would be created. (Some file systems provide "snapshots" that would allow you to have many very similar directories without undue disk space usage.)

Switching from one version to the other could be done on the user side, when users launch R, either by replacing the R executable with a script that would use the correct version, or by setting their PATH environment variable to point to the desired version. This ensures that a given session always sees the same version of everything.

like image 161
Vincent Zoonekynd Avatar answered Nov 16 '22 11:11

Vincent Zoonekynd


My strong guess is that there's no way around this.

Especially when a package includes compiled code you can't remove and replace the DLL while it's in use and expect it to still work. All of the pointers into the DLL used by R calls to those functions will ask for a particular memory location and find it inexplicably gone. (Note -- while I use the term "DLL" here, I mean it in a non-Windows-specific sense, as it is used, e.g, in the help file for ?getLoadedDLLs. "Shared library" is perhaps the better generic term.)

(Some confirmation of my suspicions comes from the R for Windows FAQ, which reports that 'Windows locks [a] package's DLL while it is loaded' which can cause update.packages() to fail.)

I'm not sure exactly how R's lazy-load mechanism is implemented, but imagine that it too could be messed with by removal of objects that it expects to find at a particular addresses in the machine.

Someone else who knows more about the internals of computers will surely give a better answer than this, but those are my thoughts.

like image 41
Josh O'Brien Avatar answered Nov 16 '22 13:11

Josh O'Brien


Here's a scenario I encountered yesterday on Windows 7.

  1. I am running an R session.
  2. Open the PDF of a package manual.
  3. Close all R sessions. Forget to close the package manual PDF.
  4. Open a new instance of R, run update.packages()

The install fails of course because Windows still has the pdf open and can't overwrite it....

like image 1
Kevin Wright Avatar answered Nov 16 '22 12:11

Kevin Wright