Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to execute R inside snakemake

Tags:

r

snakemake

I find the snakemake documentation quite terse. Since I've been asked this a few times I thought posting the question and my own answer.

How do you execute or integrate this R script in a snakemake rule?

my-script.R

CUTOFF <- 25

dat <- read.table('input-table.txt')

out <- dat[dat$V1 > CUTOFF, ]

write.table(out, 'output-table.txt')

Snakefile:

rule one:
    input:
        txt= 'input-table.txt',
    output:
        out= 'output-table.txt',
    params:
        cutoff= 25,
    # now what?        
like image 729
dariober Avatar asked May 10 '26 07:05

dariober


2 Answers

I'm going to propose three solutions, roughly in order from most canonical.

OPTION 1: Use the script directory in combination with the snakemake R object

Replace the actual filenames inside the R script with the S4 object snakemake which holds input, output, params etc:

my-script.R

CUTOFF <- snakemake@params[['cutoff']]

dat <- read.table(snakemake@input[['txt']])

out <- dat[dat$V1 > CUTOFF, ]

write.table(out, snakemake@output[['out']])

Similarly, wildcards values would be accessible via snakemake@wildcards[['some-wildcard']]

The rule would look like:

rule one:
    input:
        txt= 'input-table.txt',
    output:
        out= 'output-table.txt',
    params:
        cutoff= 25,
    script:
        '/path/to/my-script.R'

Note that the script filename must end in .R or .r to instruct snakemake that this is an R script.

Pros

  • This is the simplest option, just replace the actual filenames and parameters with snakemake@... in the R script

Cons

  • You may have many short scripts that make the pipeline difficult to follow. What does my-script.R actually do?

  • --printshellcmds option is not helpful here

  • Debugging the R script may be cluncky because of the embedded snakemake object


OPTION 2: Write a standalone R script executable via shell directive

For example using the argparse library for R:

my-script.R

#!/usr/bin/env Rscript

library(argparse)

parser <- ArgumentParser(description= 'This progrom does stuff')

parser$add_argument('--input', '-i', help= 'I am the input file')
parser$add_argument('--output', '-o', help= 'I am the output file')
parser$add_argument('--cutoff', '-c', help= 'Some filtering cutoff', type= 'double')

xargs<- parser$parse_args()

CUTOFF <- xargs$cutoff
dat <- read.table(xargs$input)
out <- dat[dat$V1 > CUTOFF, ]
write.table(out, xargs$output)

The snakemake rule is like any other one executing shell commands:

rule one:
    input:
        txt= 'input-table.txt',
    output:
        out= 'output-table.txt',
    params:
        cutoff= 25,
    shell:
        r"""
        /path/to/my-script.R --input {input.txt} --output {output.out} --cutoff {params.cutoff}
        """

Pros

  • You get a standalone, nicely documented script that you can use elsewhere

  • --printshellcmds tells you what is being executed and you can re-run it outside snakemake

Cons

  • Some setting up to do via argparse

  • Not so easy to debug by opening the R interpreter and running the individual R commands


OPTION 3 Create a temporary R script as an heredoc that you run via Rscript:

This is all self-contained in the rule:

rule one:
    input:
        txt= 'input-table.txt',
    output:
        out= 'output-table.txt',
    params:
        cutoff= 25,
    shell:
        r"""
cat <<'EOF' > {rule}.$$.tmp.R

CUTOFF <- {params.cutoff}
dat <- read.table('{input.txt}')
out <- dat[dat$V1 > CUTOFF, ]
write.table(out, '{output.out}')

EOF
Rscript {rule}.$$.tmp.R
rm {rule}.$$.tmp.R
        """

Explanantion: The syntax cat <<'EOF' tells Bash that everything until the EOF marker is a string to be written to file {rule}.$$.tmp.R. Before running this shell script, snakemake will replace the placeholders {...} as usual. So file {rule}.$$.tmp.R will be one.$$.tmp.R where bash will replace $$ with the current process ID. The combination of {rule} and $$ reasonably ensures that each temporary R script has a distinct filename. NB: EOF is not a special keyword, you can use any marker string to delimit the heredoc.

The content of {rule}.$$.tmp.R will be:

CUTOFF <- 25
dat <- read.table('input-table.txt')
out <- dat[dat$V1 > CUTOFF, ]
write.table(out, 'output-table.txt')

after execution via Rscript, {rule}.$$.tmp.R will be deleted by rm provided that Rscripts exited clean.

Pros

  • Especially for short scripts, you see what the rule actually does. No need to look into other script files.

  • --printshellcmds option shows the exact R code. You can copy & paste it to the R interpreter as it is which is very useful for debugging and developing

Cons

  • Clunky, quite a bit of boiler plate code

  • If input/output/params are lists, you need to split them with e.g. strsplit('{params}', sep= ' ') inside the R script. This is not great especially if you have space inside the list items.

  • If Rscript fails the temp R script is not deleted and it litters your working dir.

like image 71
dariober Avatar answered May 12 '26 04:05

dariober


I've created a simple template for running snakemake with R - here's the Github link:

https://github.com/fritzbayer/snakemake-with-R

It shows two simple options for passing variables between snakemake and R.

like image 40
Fritz Avatar answered May 12 '26 02:05

Fritz