Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - Connect Scripts via Pipes

Tags:

unix

r

pipe

I have a number of R scripts that I would like to chain together using a UNIX-style pipeline. Each script would take as input a data frame and provide a data frame as output. For example, I am imagining something like this that would run in R's batch mode.

  cat raw-input.Rds | step1.R | step2.R | step3.R | step4.R > result.Rds

Any thoughts on how this could be done?

like image 306
Nick Allen Avatar asked Dec 09 '25 10:12

Nick Allen


2 Answers

Writing executable scripts is not the hard part, what is tricky is how to make the scripts read from files and/or pipes. I wrote a somewhat general function here: https://stackoverflow.com/a/15785789/1201032

Here is an example where the I/O takes the form of csv files:

Your step?.R files should look like this:

#!/usr/bin/Rscript

OpenRead <- function(arg) {

   if (arg %in% c("-", "/dev/stdin")) {
      file("stdin", open = "r")
   } else if (grepl("^/dev/fd/", arg)) {
      fifo(arg, open = "r")
   } else {
      file(arg, open = "r")
   }
}

args  <- commandArgs(TRUE)
file  <- args[1]
fh.in <- OpenRead(file)

df.in <- read.csv(fh.in)
close(fh.in)

# do something
df.out <- df.in

# print output
write.csv(df.out, file = stdout(), row.names = FALSE, quote = FALSE)

and your csv input file should look like:

col1,col2
a,1
b,2

Now this should work:

cat in.csv | ./step1.R - | ./step2.R -

The - are annoying but necessary. Also make sure to run something like chmod +x ./step?.R to make your scripts executables. Finally, you could store them (and without extension) inside a directory that you add to your PATH, so you will be able to run it like this:

cat in.csv | step1 - | step2 -
like image 194
flodel Avatar answered Dec 12 '25 00:12

flodel


Why on earth you want to cram your workflow into pipes when you have the whole R environment available is beyond me.

Make a main.r containing the following:

source("step1.r")
source("step2.r")
source("step3.r")
source("step4.r")

That's it. You don't have to convert the output of each step into a serialised format; instead you can just leave all your R objects (datasets, fitted models, predicted values, lattice/ggplot graphics, etc) as they are, ready for the next step to process. If memory is a problem, you can rm any unneeded objects at the end of each step; alternatively, each step can work with an environment which it deletes when done, first exporting any required objects to the global environment.


If modular code is desired, you can recast your workflow as follows. Encapsulate the work done by each file into one or more functions. Then call these functions in your main.r with the appropriate arguments.

source("step1.r")  # defines step1_read_input, step1_f2
source("step2.r")  # defines step2_f2
source("step3.r")  # defines step3_f1, step3_f2, step3_f3
source("step4.r")  # defines step4_write_output

step1_read_input(...)
step1_f2(...)
....
step4write_output(...)
like image 36
Hong Ooi Avatar answered Dec 11 '25 23:12

Hong Ooi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!