I recently decided to start with snakemake. I can't find anything that fits my needs neither on stack, nor on the snakemake doc. I feel like I don't understand something and I may need some explanations.
I am trying to make a simple snakemake workflow that take as input, a fastq file and a sequencing-summary file (that contains infos about the reads) and filter the reads within the fast into several file (low.fastq and high.fastq).
My input data and my Snakefile I'm trying to execute are stored like this :
.
├── data
│ ├── sequencing-summary-example.txt
│ └── tiny-example.fastq
├── Snakefile
└── split_fastq
And this is what I've tried so far :
*imports*
rule targets:
input:
"split_fastq/low.fastq",
"split_fastq/high.fastq"
rule split_fastq:
input:
"data/{reads}.fastq",
"data/{seqsum}.txt"
output:
"split_fastq/low.fastq",
"split_fastq/high.fastq"
run:
* do the thing *
I expected to have a directory "split_fastq" filled with a "low" and a "high" fastq. But instead I got the error :
Building DAG of jobs...
WildcardError in line 10 of /work/sbsuser/test/roxane/alignement-ont/Snakefile:
Wildcards in input files cannot be determined from output files:
'reads'
Even though it seems to be a very popular error, I'm not sure if I don't understand how to use wildcards or if there is an other problem. Am I using the "input" and "output" correctly ?
The problem is that you have the wildcard in the input, but not in the output. Wildcards are required in the output. Think about it this way, by putting the wildcard in the input, you're creating a rule that you are intending to be run individually on many different fastq files. But the output files for that rule will be exactly the same file for each of those different fastq files. They'll overwrite each other! You want to incorporate the wildcard into your output files so you get a unique file for each possible input, for example:
rule split_fastq:
input:
"data/{reads}.fastq",
"data/{seqsum}.txt"
output:
"split_fastq/{reads}.low.fastq",
"split_fastq/{reads}.high.fastq"
run:
* do the thing *
Now with tiny-example.fastq
as your input, you'll get tiny-example.low.fastq
and tiny-example.high.fastq
as output. And if you add a second fastq file, you'll get different high and low output files for that one. But this rule still won't work because the "seqsum" wildcard is also not part of the output. What you'll probably want to do in this case is have the sequence-summary-example.txt
incorporate the name of the fastq file, for example call it sequence-summary-tiny-example.txt
. Now you can make your rule like this:
rule split_fastq:
input:
"data/{reads}.fastq",
"data/sequence-summary-{reads}.txt"
output:
"split_fastq/{reads}.low.fastq",
"split_fastq/{reads}.high.fastq"
run:
* do the thing *
And now if you then add an other-example.fastq
and sequence-summary-other-example.txt
, your snakemake pipeline should be able to create other-example.low.fastq
and other-example.high.fastq
.
Snakemake always thinks backwards from how we tend to think. We first think about the input, and then what output it creates. But Snakemake knows what file it needs to make, and it's trying to figure out what input it needs to make it. So in your original rule, it knew it needed to make low.fastq
, and it saw that the split_fastq
rule could make that, but then it didn't know what the wildcard "reads" in the input should be. Now, in the new rule, it knows it needs to make tiny-example.low.fastq
and sees that split_fastq
can create output files of the template {reads}.low.fastq
, so it says "Hey, if I make reads = tiny-example
, then I can use this rule!" and then it looks at the input and says "Ok, since for input I need {reads}.fastq
and I know reads = tiny-example
then that means for input I need tiny-example.fastq
, and I have that!"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With