When a file that was made earlier in the pipeline is removed, SnakeMake does not seem to consider that a problem, as long as later files are there:
rule All:
input: "testC1.txt", "testC2.txt"
rule A:
input: "{X}{Y}.txt"
output: "{X}A{Y}.txt"
shell: "cp {input} {output}"
rule B:
input: "{X}A{Y}.txt"
output: "{X}B{Y}.txt"
shell: "cp {input} {output}"
rule C:
input: "{X}B{Y}.txt"
output: "{X}C{Y}.txt"
shell: "cp {input} {output}"
Save this SnakeFile in test.sf and do this:
rm testA*.txt testB*.txt testC*.txt
echo "test1" >test1.txt
echo "test2" >test2.txt
snakemake -s test.sf
# Rerun:
snakemake -s test.sf
# SnakeMake says all is up to date, which it is.
# Remove intermediate results:
rm testA1.txt
# Rerun:
snakemake -s test.sf
SnakeMake says all is up to date. It does not detect missing testA1.txt.
I seem to recall something in the online SnakeMake manual about this, but I can no longer find it.
I assume this is expected SnakeMake behavior. It can sometimes be desired behavior, but sometimes you may want it to detect and rebuild the missing file. How can this be done?
A Snakemake workflow is defined by specifying rules in a Snakefile. Rules decompose the workflow into small steps (for example, the application of a single tool) by specifying how to create sets of output files from sets of input files.
Execute snakemake rules with the given submit command, e.g. qsub. Snakemake compiles jobs into scripts that are submitted to the cluster with the given command, once all input files for a particular job are present.
{sample} is a wildcardUsing the same wildcards in the input and output is what tells Snakemake how to match input files to output files. If two rules use a wildcard with the same name then Snakemake will treat them as completely different - rules in Snakemake are self-contained in this way.
Summary: Snakemake is a workflow engine that provides a readable Python-based workflow definition language and a powerful execution environment that scales from single-core workstations to compute clusters without modifying the workflow.
As mentioned in this other answer, the -R
parameter can help, but there are more options:
When you call
snakemake -F
this will trigger a rebuild of the whole pipeline. This basically means, forget all intermediate files and start anew. This will definitely (re-) generate all intermediate files on the way. The downside is: it might take some time.
This is the realm of the -R <rule>
parameter. This re-runs the given rule and all rules that depend on it. So in your case
snakemake -R A -s test.sf
would rerun rule A (to build testA1.txt
from test.txt
) and the rules B, C and All, since they depend on A. Mind that this runs all copies of rule A that are required, so in your example testA2.txt
and everything that follows from it is also rebuild.
If, in your example, you would have removed testB1.txt
instead, only the rules B
and C
would have been rerun.
If I remember correctly, snakemake detects if a file needs to be rebuild by its utime. So if you have a version of testA1.txt
that is younger (as in more recently created) than testB1.txt
, testB1.txt
has to be rebuild using rule B
, to assure everything is up to date. Hence, you cannot easily rebuild only testA1.txt
without also building all following files unless you somehow change the files' utimes.
I have not tried this out, but this can be done with snakemakes --touch
parameter. If you manage to only run rule A
and then run snakemake -R B -t
,which touches all output files of the rules B
and following, you could get a valid workflow state without actually rerunning all steps in between.
I found this thread a while ago about the --forcerun
/-R
parameter that might be informative.
Ultimately, snakemake will force execution of the entire pipeline if you want to regenerate that intermediate file without having a separate rule for it or including it as a target in all.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With