I find the snakemake documentation quite terse. Since I've been asked this a few times I thought posting the question and my own answer.
How do you execute or integrate this R script in a snakemake rule?
my-script.R
CUTOFF <- 25
dat <- read.table('input-table.txt')
out <- dat[dat$V1 > CUTOFF, ]
write.table(out, 'output-table.txt')
Snakefile:
rule one:
input:
txt= 'input-table.txt',
output:
out= 'output-table.txt',
params:
cutoff= 25,
# now what?
I'm going to propose three solutions, roughly in order from most canonical.
OPTION 1: Use the script directory in combination with the snakemake R object
Replace the actual filenames inside the R script with the S4 object snakemake
which holds input, output, params etc:
my-script.R
CUTOFF <- snakemake@params[['cutoff']]
dat <- read.table(snakemake@input[['txt']])
out <- dat[dat$V1 > CUTOFF, ]
write.table(out, snakemake@output[['out']])
Similarly, wildcards values would be accessible via snakemake@wildcards[['some-wildcard']]
The rule would look like:
rule one:
input:
txt= 'input-table.txt',
output:
out= 'output-table.txt',
params:
cutoff= 25,
script:
'/path/to/my-script.R'
Note that the script filename must end in .R or .r to
instruct snakemake that this is an R script.
Pros
snakemake@... in the R scriptCons
You may have many short scripts that make the pipeline difficult to follow.
What does my-script.R actually do?
--printshellcmds option is not helpful here
Debugging the R script may be cluncky because of the embedded snakemake object
OPTION 2: Write a standalone R script executable via shell directive
For example using the argparse library for R:
my-script.R
#!/usr/bin/env Rscript
library(argparse)
parser <- ArgumentParser(description= 'This progrom does stuff')
parser$add_argument('--input', '-i', help= 'I am the input file')
parser$add_argument('--output', '-o', help= 'I am the output file')
parser$add_argument('--cutoff', '-c', help= 'Some filtering cutoff', type= 'double')
xargs<- parser$parse_args()
CUTOFF <- xargs$cutoff
dat <- read.table(xargs$input)
out <- dat[dat$V1 > CUTOFF, ]
write.table(out, xargs$output)
The snakemake rule is like any other one executing shell commands:
rule one:
input:
txt= 'input-table.txt',
output:
out= 'output-table.txt',
params:
cutoff= 25,
shell:
r"""
/path/to/my-script.R --input {input.txt} --output {output.out} --cutoff {params.cutoff}
"""
Pros
You get a standalone, nicely documented script that you can use elsewhere
--printshellcmds tells you what is being executed and you can re-run it outside snakemake
Cons
Some setting up to do via argparse
Not so easy to debug by opening the R interpreter and running the individual R commands
OPTION 3 Create a temporary R script as an heredoc that you run via Rscript:
This is all self-contained in the rule:
rule one:
input:
txt= 'input-table.txt',
output:
out= 'output-table.txt',
params:
cutoff= 25,
shell:
r"""
cat <<'EOF' > {rule}.$$.tmp.R
CUTOFF <- {params.cutoff}
dat <- read.table('{input.txt}')
out <- dat[dat$V1 > CUTOFF, ]
write.table(out, '{output.out}')
EOF
Rscript {rule}.$$.tmp.R
rm {rule}.$$.tmp.R
"""
Explanantion: The syntax cat <<'EOF' tells Bash that everything until the EOF
marker is a string to be written to file {rule}.$$.tmp.R. Before running this
shell script, snakemake will replace the placeholders {...} as usual. So file
{rule}.$$.tmp.R will be one.$$.tmp.R where bash will replace $$ with the
current process ID. The combination of {rule} and $$ reasonably ensures
that each temporary R script has a distinct filename. NB: EOF is not a special
keyword, you can use any marker string to delimit the heredoc.
The content of {rule}.$$.tmp.R will be:
CUTOFF <- 25
dat <- read.table('input-table.txt')
out <- dat[dat$V1 > CUTOFF, ]
write.table(out, 'output-table.txt')
after execution via Rscript, {rule}.$$.tmp.R will be deleted by rm provided
that Rscripts exited clean.
Pros
Especially for short scripts, you see what the rule actually does. No need to look into other script files.
--printshellcmds option shows the exact R code. You can copy & paste it
to the R interpreter as it is which is very useful for debugging and developing
Cons
Clunky, quite a bit of boiler plate code
If input/output/params are lists, you need to split them with e.g.
strsplit('{params}', sep= ' ') inside the R script. This is not great
especially if you have space inside the list items.
If Rscript fails the temp R script is not deleted and it litters your working dir.
I've created a simple template for running snakemake with R - here's the Github link:
https://github.com/fritzbayer/snakemake-with-R
It shows two simple options for passing variables between snakemake and R.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With