Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to properly use wildcards in input and output

Tags:

snakemake

I recently decided to start with snakemake. I can't find anything that fits my needs neither on stack, nor on the snakemake doc. I feel like I don't understand something and I may need some explanations.

I am trying to make a simple snakemake workflow that take as input, a fastq file and a sequencing-summary file (that contains infos about the reads) and filter the reads within the fast into several file (low.fastq and high.fastq).

My input data and my Snakefile I'm trying to execute are stored like this :

.
├── data
│   ├── sequencing-summary-example.txt 
│   └── tiny-example.fastq 
├── Snakefile
└── split_fastq

And this is what I've tried so far :

*imports*
rule targets:
    input:
        "split_fastq/low.fastq",
        "split_fastq/high.fastq"

rule split_fastq:
    input:
        "data/{reads}.fastq",
        "data/{seqsum}.txt"
    output:
        "split_fastq/low.fastq",
        "split_fastq/high.fastq"
    run:
        * do the thing *

I expected to have a directory "split_fastq" filled with a "low" and a "high" fastq. But instead I got the error :

Building DAG of jobs...
WildcardError in line 10 of /work/sbsuser/test/roxane/alignement-ont/Snakefile:
Wildcards in input files cannot be determined from output files:
'reads'

Even though it seems to be a very popular error, I'm not sure if I don't understand how to use wildcards or if there is an other problem. Am I using the "input" and "output" correctly ?

like image 384
Roxane Avatar asked Aug 27 '19 13:08

Roxane


1 Answers

The problem is that you have the wildcard in the input, but not in the output. Wildcards are required in the output. Think about it this way, by putting the wildcard in the input, you're creating a rule that you are intending to be run individually on many different fastq files. But the output files for that rule will be exactly the same file for each of those different fastq files. They'll overwrite each other! You want to incorporate the wildcard into your output files so you get a unique file for each possible input, for example:

rule split_fastq:
    input:
        "data/{reads}.fastq",
        "data/{seqsum}.txt"
    output:
        "split_fastq/{reads}.low.fastq",
        "split_fastq/{reads}.high.fastq"
    run:
        * do the thing *

Now with tiny-example.fastq as your input, you'll get tiny-example.low.fastq and tiny-example.high.fastq as output. And if you add a second fastq file, you'll get different high and low output files for that one. But this rule still won't work because the "seqsum" wildcard is also not part of the output. What you'll probably want to do in this case is have the sequence-summary-example.txt incorporate the name of the fastq file, for example call it sequence-summary-tiny-example.txt. Now you can make your rule like this:

rule split_fastq:
    input:
        "data/{reads}.fastq",
        "data/sequence-summary-{reads}.txt"
    output:
        "split_fastq/{reads}.low.fastq",
        "split_fastq/{reads}.high.fastq"
    run:
        * do the thing *

And now if you then add an other-example.fastq and sequence-summary-other-example.txt, your snakemake pipeline should be able to create other-example.low.fastq and other-example.high.fastq.

Snakemake always thinks backwards from how we tend to think. We first think about the input, and then what output it creates. But Snakemake knows what file it needs to make, and it's trying to figure out what input it needs to make it. So in your original rule, it knew it needed to make low.fastq, and it saw that the split_fastq rule could make that, but then it didn't know what the wildcard "reads" in the input should be. Now, in the new rule, it knows it needs to make tiny-example.low.fastq and sees that split_fastq can create output files of the template {reads}.low.fastq, so it says "Hey, if I make reads = tiny-example, then I can use this rule!" and then it looks at the input and says "Ok, since for input I need {reads}.fastq and I know reads = tiny-example then that means for input I need tiny-example.fastq, and I have that!"

like image 150
Colin Avatar answered Nov 08 '22 23:11

Colin