Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can SnakeMake be forced to rerun rules when files are missing

When a file that was made earlier in the pipeline is removed, SnakeMake does not seem to consider that a problem, as long as later files are there:

rule All:
    input: "testC1.txt", "testC2.txt"

rule A:
    input: "{X}{Y}.txt"
    output: "{X}A{Y}.txt"
    shell: "cp {input} {output}"

rule B:
    input: "{X}A{Y}.txt"
    output: "{X}B{Y}.txt"
    shell: "cp {input} {output}"

rule C:
    input: "{X}B{Y}.txt"
    output: "{X}C{Y}.txt"
    shell: "cp {input} {output}"

Save this SnakeFile in test.sf and do this:

rm testA*.txt testB*.txt testC*.txt
echo "test1" >test1.txt
echo "test2" >test2.txt
snakemake -s test.sf
# Rerun:
snakemake -s test.sf
# SnakeMake says all is up to date, which it is.
# Remove intermediate results:
rm testA1.txt
# Rerun:
snakemake -s test.sf

SnakeMake says all is up to date. It does not detect missing testA1.txt.

I seem to recall something in the online SnakeMake manual about this, but I can no longer find it.

I assume this is expected SnakeMake behavior. It can sometimes be desired behavior, but sometimes you may want it to detect and rebuild the missing file. How can this be done?

like image 692
tedtoal Avatar asked Aug 31 '17 20:08

tedtoal


People also ask

How does Snakemake work?

A Snakemake workflow is defined by specifying rules in a Snakefile. Rules decompose the workflow into small steps (for example, the application of a single tool) by specifying how to create sets of output files from sets of input files.

How do you run a rule in Snakemake?

Execute snakemake rules with the given submit command, e.g. qsub. Snakemake compiles jobs into scripts that are submitted to the cluster with the given command, once all input files for a particular job are present.

What is a wildcard Snakemake?

{sample} is a wildcardUsing the same wildcards in the input and output is what tells Snakemake how to match input files to output files. If two rules use a wildcard with the same name then Snakemake will treat them as completely different - rules in Snakemake are self-contained in this way.

What is Snakemake Python?

Summary: Snakemake is a workflow engine that provides a readable Python-based workflow definition language and a powerful execution environment that scales from single-core workstations to compute clusters without modifying the workflow.


2 Answers

As mentioned in this other answer, the -R parameter can help, but there are more options:

Force a rebuild of the whole workflow

When you call

snakemake -F

this will trigger a rebuild of the whole pipeline. This basically means, forget all intermediate files and start anew. This will definitely (re-) generate all intermediate files on the way. The downside is: it might take some time.

Force a specific rule

This is the realm of the -R <rule> parameter. This re-runs the given rule and all rules that depend on it. So in your case

snakemake -R A -s test.sf

would rerun rule A (to build testA1.txt from test.txt) and the rules B, C and All, since they depend on A. Mind that this runs all copies of rule A that are required, so in your example testA2.txt and everything that follows from it is also rebuild.

If, in your example, you would have removed testB1.txt instead, only the rules B and C would have been rerun.

Why does this happen?

If I remember correctly, snakemake detects if a file needs to be rebuild by its utime. So if you have a version of testA1.txt that is younger (as in more recently created) than testB1.txt, testB1.txt has to be rebuild using rule B, to assure everything is up to date. Hence, you cannot easily rebuild only testA1.txt without also building all following files unless you somehow change the files' utimes.

I have not tried this out, but this can be done with snakemakes --touch parameter. If you manage to only run rule A and then run snakemake -R B -t ,which touches all output files of the rules B and following, you could get a valid workflow state without actually rerunning all steps in between.

like image 102
m00am Avatar answered Oct 15 '22 12:10

m00am


I found this thread a while ago about the --forcerun/-R parameter that might be informative.

Ultimately, snakemake will force execution of the entire pipeline if you want to regenerate that intermediate file without having a separate rule for it or including it as a target in all.

like image 45
Jon Chung Avatar answered Oct 15 '22 13:10

Jon Chung