This question follows on from a question I asked previously and it regards understanding how to access config files correctly using Snakemake. I have a specific problem I need to address which I'll ask first and a general problem understanding how indexing works which I'll ask second.
I'm using snakemake to run and ATAC-seq pipeline from Alignment/QC through to motif analysis.
A: Specific Question
I'm trying to add a rule called trim_galore_pe
to trim adapters from my fastq files before alignment and an error statement is thrown from snakemake as the names of the output files generated by trim galore
do not match what is expected by snakemake. This is because I cannot work out how to write the output file statement correctly in my snakemake file to make the names match.
An example of the names generated by TRIM GALORE
contain SRA numbers, for example:
trimmed_fastq_files/SRR2920475_1_val_1.fq.gz
Whereas the file expected by snakemake contain sample references and should read:
trimmed_fastq_files/Corces2016_4983.7B_Mono_1_val_1.fq.gz
This also affects the subsequent rules after the trim_galore_pe
rule. I need to work out a way to use the info in my config file to generate the output files required.
For all rules after those shown in the Snakefile I need files to be named by sample name i.e. Corces2016_4983.7A_Mono
. It would also be useful for all the FAST_QC
and MULTIQC
rules shown in the Snakefile below to have the sample names in the output file name structure, which they all already do in the current Snakefile.
However, inputs for Bowtie2, the FASTQC rules and input and output of the trim_galore_pe
rules need to contain the SRA numbers. The problem starts at the output of trim_galore
and influences all downstream rules.
Although I have extracted SRA numbers in previous rules, I'm not sure how to do this when not using the fastq_files
folder which is explicitly stated in the config file. By introducing the trim_galore_pe
rule I have effectively moved a new set of SRA files into the new trimmed_fastq_files
folder. How to extract only the SRA number from the list of SRA files config file containing the old folder names whilst referencing the new trimmed_fastq_files
folder in the Snakefile is the crux of my issue.
I hope this is clear.
Here is my config file:
samples:
Corces2016_4983.7A_Mono: fastq_files/SRR2920475
Corces2016_4983.7B_Mono: fastq_files/SRR2920476
cell_types:
Mono:
- Corces2016_4983.7A
index: /home/genomes_and_index_files/hg19
Here is my Snakefile:
# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])
rule all:
input:
expand("FastQC/PRETRIM/{sample}_{num}_fastqc.zip", sample=config["samples"], num=['1', '2']),
expand("bam_files/{sample}.bam", sample=config["samples"]),
"FastQC/PRETRIM/fastq_multiqc.html",
"FastQC/POSTTRIM/fastq_multiqc.html"
rule fastqc_pretrim:
input:
sample=lambda wildcards: f"{config['samples'][wildcards.sample]}_{wildcards.num}.fastq.gz"
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/PRETRIM/{sample}_{num}_fastqc.html",
zip="FastQC/PRETRIM/{sample}_{num}_fastqc.zip"
wrapper:
"0.23.1/bio/fastqc"
rule multiqc_fastq_pretrim:
input:
expand("FastQC/PRETRIM/{sample}_{num}_fastqc.html", sample=config["samples"], num=['1', '2'])
output:
"FastQC/PRETRIM/fastq_multiqc.html"
wrapper:
"0.23.1/bio/multiqc"
rule trim_galore_pe:
input:
sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
output:
"trimmed_fastq_files/{sample}_1_val_1.fq.gz",
"trimmed_fastq_files/{sample}_1.fastq.gz_trimming_report.txt",
"trimmed_fastq_files/{sample}_2_val_2.fq.gz",
"trimmed_fastq_files/{sample}_2.fastq.gz_trimming_report.txt"
params:
extra="--illumina -q 20"
log:
"logs/trim_galore/{sample}.log"
wrapper:
"0.23.1/bio/trim_galore/pe"
rule fastqc_posttrim:
input:
"trimmed_fastq_files/{sample}_1_val_1.fq.gz", "trimmed_fastq_files/{sample}_2_val_2.fq.gz"
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/POSTTRIM/{sample}_{num}_fastqc.html",
zip="FastQC/POSTTRIM/{sample}_{num}_fastqc.zip"
wrapper:
"0.23.1/bio/fastqc"
rule multiqc_fastq_posttrim:
input:
expand("FastQC/POSTTRIM/{sample}_{num}.trim_fastqc.html", sample=config["samples"], num=['1', '2'])
output:
"FastQC/POSTTRIM/fastq_multiqc.html"
wrapper:
"0.23.1/bio/multiqc"
rule bowtie2:
input:
"trimmed_fastq_files/{sample}_1_val_1.fq.gz", "trimmed_fastq_files/{sample}_2_val_2.fq.gz"
output:
"bam_files/{sample}.bam"
log:
"logs/bowtie2/{sample}.txt"
params:
index=config["index"], # prefix of reference genome index (built with bowtie2-build),
extra=""
threads: 8
wrapper:
"0.23.1/bio/bowtie2/align"
This currently runs, and give a full job list using snakemake -np
, but throws the error mentioned above.
B: General question
Is there an online resource that explains succinctly how to reference a config file using python, particularly with reference to snakemake? The online docs are pretty insufficient and assume prior knowledge of python.
My programming experience is mainly in bash and R but I like Snakemake and do generally understand how dictionaries and lists work in python and how to reference items stored within them. However I find the complex use of bracketing, wildcards and inverted commas in some of the Snakemake rules above confusing so tend to struggle when trying to reference different parts of file names in the config file. I want to understand fully how to utilise these elements.
For example, in a rule such as this from the Snakefile posted above:
sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
What is actually happening in this command? My understanding is that we are accessing the config file using config['samples']
and we are using the [wildcards.sample]
part to explicitly access the fastq_files/SRR2920475
part of the config file. The expand allows us to iterate through each item in the config file that fit the parameters in the command, i.e all the SRA files, and the lambda wildcards is needed to use the wildcards
call in the command. What I'm uncertain about is:
f
do just after the expand and why is it needed?config['samples']
contain inverted commas within the square brackets but but inverted commas are not needed around [wildcards.sample]
?num
, but again these numbers are sometimes enclosed around inverted commas and sometimes not...why?Any advice, tips, pointers would be greatly appreciated.
C: Clarification on suggestions made below by @bli
I have edited my config file as you suggested in your comment and omitted the folder names leaving only the SRA numbers. This make sense to me, but I have a couple of other issues preventing me getting this Snakefile running.
New config file:
samples:
Corces2016_4983.7A_Mono: SRR2920475
Corces2016_4983.7B_Mono: SRR2920476
cell_types:
Mono:
- Corces2016_4983.7A
index: /home/c1477909/genomes_and_index_files/hg19
New Snakefile:
# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])
rule all:
input:
expand("FastQC/PRETRIM/{sample}_{num}_fastqc.zip", sample=config["samples"], num=['1', '2']),
expand("bam_files/{sample}.bam", sample=config["samples"]),
"FastQC/PRETRIM/fastq_multiqc.html",
"FastQC/POSTTRIM/fastq_multiqc.html",
rule fastqc_pretrim:
input:
lambda wildcards: f"fastq_files/{config['samples'][wildcards.sample]}_{wildcards.num}.fastq.gz"
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/PRETRIM/{sample}_{num}_fastqc.html",
zip="FastQC/PRETRIM/{sample}_{num}_fastqc.zip"
wrapper:
"0.23.1/bio/fastqc"
rule multiqc_fastq_pretrim:
input:
expand("FastQC/PRETRIM/{sample}_{num}_fastqc.html", sample=config["samples"], num=['1', '2'])
output:
"FastQC/PRETRIM/fastq_multiqc.html"
wrapper:
"0.23.1/bio/multiqc"
rule trim_galore_pe:
input:
lambda wildcards: expand(f"fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
output:
"trimmed_fastq_files/{wildcards.sample}_1_val_1.fq.gz",
"trimmed_fastq_files/{wildcards.sample}_1.fastq.gz_trimming_report.txt",
"trimmed_fastq_files/{wildcards.sample}_2_val_2.fq.gz",
"trimmed_fastq_files/{wildcards.sample}_2.fastq.gz_trimming_report.txt"
params:
extra="--illumina -q 20"
log:
"logs/trim_galore/{sample}.log"
wrapper:
"0.23.1/bio/trim_galore/pe"
rule fastqc_posttrim:
input:
lambda wildcards: expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz", num=[1,2])
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/POSTTRIM/{sample}_{num}_fastqc.html",
zip="FastQC/POSTTRIM/{sample}_{num}_fastqc.zip"
wrapper:
"0.23.1/bio/fastqc"
rule multiqc_fastq_posttrim:
input:
expand("FastQC/POSTTRIM/{sample}_{num}.trim_fastqc.html", sample=config["samples"], num=['1', '2'])
output:
"FastQC/POSTTRIM/fastq_multiqc.html"
wrapper:
"0.23.1/bio/multiqc"
rule bowtie2:
input:
lambda wildcards: expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz", num=[1,2])
output:
"bam_files/{sample}.bam"
log:
"logs/bowtie2/{sample}.txt"
params:
index=config["index"], # prefix of reference genome index (built with bowtie2-build),
extra=""
threads: 8
wrapper:
"0.23.1/bio/bowtie2/align"
Using these new files everything worked fine initially, a partial job list was created by snakemake -np
. However, this is because half of the complete job list had already been run; that is the trimmed_fastq_files
folder was generated and the correctly named trimmed fastq files were in place within it. When I deleted all the previously created files to see if the entire new version of the Snakefile would work properly, snakemake -np
failed, stating that there were missing input files for the rules downstream of the trim_galore_pe
rule.
As you can see I'm trying to call the {wildcard.sample}
variable set in the input section of the trim_galore_pe
rule in the output section, but snakemake doesn't like this. Is is possible to do this?
I also tried this using the tips from the answers below but this didn't work either:
rule trim_galore_pe:
input:
sample=lambda wildcards: expand(f"fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
output:
expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz", num=[1,2]),
expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz_trimming_report.txt", num=[1,2])
params:
extra="--illumina -q 20"
log:
"logs/trim_galore/{sample}.log"
wrapper:
"0.23.1/bio/trim_galore/pe"
The error then stated wildcards not defined
. So, logically I tried putting lambda wildcards:
in front of the two expand statements of the output section in an attempt to define the wildcards, but this threw a syntax error, Only input files can be specified as functions
. I also tried using some of the indexing suggestions below but couldn't get the right combination.
This is probably caused by another thing I'm unsure about regarding Snakefiles and that is how scoping works.
rule all
can all other rules access it? {wildcard.sample}
variable if I defined it in the input section? Is that because that variable is enclosed within a 'closed' scope lambda function?Any (further) suggestions would be greatly appreciated.
I'll try to answer your question B, and give extra details that I hope can be useful for you and others.
Edit: I added some attempts at answering question C at the end.
First, regarding what you call "inverted" commas, they are usually called "single quotes", and they are used in python to build strings. Double quotes can also be used for the same purpose. The main difference is when you try to create strings that contain quotes. Using double quotes allows you to create strings containing single quotes, and vice-versa. Otherwise, you need to "escape" the quote using backslashes ("\"):
s1 = 'Contains "double quotes"'
s1_bis = "Contains \"double quotes\""
s2 = "Contains 'single quotes'"
s2_bis = 'Contains \'single quotes\''
(I tend to prefer double quotes, that's just a personal taste.)
rule trim_galore_pe:
input:
sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
You are assigning a function (lambda wildcards: ...
) to a variable (sample
), which happens to belong to the input section of a rule.
This will cause snakemake to use this function when it comes to determine the input of a particular instance of the rule, based on the current values of the wildcards (as inferred from the current value of the output it wants to generate).
For clarity, one could very likely rewrite this by separating the function definition from the rule declaration, without using the lambda
construct, and it would work identically:
def determine_sample(wildcards):
return expand(
f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz",
num=[1,2])
rule trim_galore_pe:
input:
sample = determine_sample
expand
is a snakemake-specific function (but you can import it in any python program or interactive interpreter with from snakemake.io import expand
), that makes it easier to generate lists of strings. In the following interactive python3.6 session we will try to reproduce what happens when you use it, using different native python constructs.
# We'll try to see how `expand` works, we can import it from snakemake
from snakemake.io import expand
# We want to see how it works using the following example
# expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
# To make the example work, we will first simulate the reading
# of a configuration file
import yaml
config_text = """
samples:
Corces2016_4983.7A_Mono: fastq_files/SRR2920475
Corces2016_4983.7B_Mono: fastq_files/SRR2920476
cell_types:
Mono:
- Corces2016_4983.7A
index: /home/genomes_and_index_files/hg19
"""
# Here we used triple quotes, to have a readable multi-line string.
# The following is equivalent to what snakemake does with the configuration file:
config = yaml.load(config_text)
config
Output:
{'cell_types': {'Mono': ['Corces2016_4983.7A']},
'index': '/home/genomes_and_index_files/hg19',
'samples': {'Corces2016_4983.7A_Mono': 'fastq_files/SRR2920475',
'Corces2016_4983.7B_Mono': 'fastq_files/SRR2920476'}}
We obtained a dictionary in which the key "samples" is associated with a nested dictionary.
# We can access the nested dictionary as follows
config["samples"]
# Note that single quotes could be used instead of double quotes
# Python interactive interpreter uses single quotes when it displays strings
Output:
{'Corces2016_4983.7A_Mono': 'fastq_files/SRR2920475',
'Corces2016_4983.7B_Mono': 'fastq_files/SRR2920476'}
# We can access the value corresponding to one of the keys
# again using square brackets
config["samples"]["Corces2016_4983.7A_Mono"]
Output:
'fastq_files/SRR2920475'
# Now, we will simulate a `wildcards` object that has a `sample` attribute
# We'll use a namedtuple for that
# https://docs.python.org/3/library/collections.html#collections.namedtuple
from collections import namedtuple
Wildcards = namedtuple("Wildcards", ["sample"])
wildcards = Wildcards(sample="Corces2016_4983.7A_Mono")
wildcards.sample
Output:
'Corces2016_4983.7A_Mono'
Edit (15/11/2018): I found out a better way of creating wildcards:
from snakemake.io import Wildcards
wildcards = Wildcards(fromdict={"sample": "Corces2016_4983.7A_Mono"})
# We can use this attribute as a key in the nested dictionary
# instead of using directly the string
config["samples"][wildcards.sample]
# No quotes here: `wildcards.sample` is a string variable
Output:
'fastq_files/SRR2920475'
expand
# Now, the expand of the example works, and it results in a list with two strings
expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
# Note: here, single quotes are used for the string "sample",
# in order not to close the opening double quote of the whole string
Output:
['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']
# Internally, I think what happens is something similar to the following:
filename_template = f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz"
# This template is then used for each element of this "list comprehension"
[filename_template.format(num=num) for num in [1, 2]]
Output:
['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']
# This is equivalent to building the list using a for loop:
filenames = []
for num in [1, 2]:
filename = filename_template.format(num=num)
filenames.append(filename)
filenames
Output:
['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']
# It is interesting to have a look at `filename_template`
filename_template
Output:
'fastq_files/SRR2920475_{num}.fastq.gz'
# The part between curly braces can be substituted
# during a string formatting operation:
"fastq_files/SRR2920475_{num}.fastq.gz".format(num=1)
Output:
'fastq_files/SRR2920475_1.fastq.gz'
Now let's further show how string formatting can be used.
# In python 3.6 and above, one can create formatted strings
# in which the values of variables are interpreted inside the string
# if the string is prefixed with `f`.
# That's what happens when we create `filename_template`:
filename_template = f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz"
filename_template
Output:
'fastq_files/SRR2920475_{num}.fastq.gz'
Two substitutions happened during the formatting of the string:
The value of config['samples'][wildcards.sample]
was used to make the first part of the string. (Single quotes were used around sample
because this python expression was inside a string built with double quotes.)
The double brackets around num
were reduced to single ones as part of the formatting operation. That's why we can then use this again in further formatting operations involving num
.
# Equivalently, without using 3.6 syntax:
filename_template = "{filename_prefix}_{{num}}.fastq.gz".format(
filename_prefix = config["samples"][wildcards.sample])
filename_template
Output:
'fastq_files/SRR2920475_{num}.fastq.gz'
# We could achieve the same by first extracting the value
# from the `config` dictionary
filename_prefix = config["samples"][wildcards.sample]
filename_template = f"{filename_prefix}_{{num}}.fastq.gz"
filename_template
Output:
'fastq_files/SRR2920475_{num}.fastq.gz'
# Or, equivalently:
filename_prefix = config["samples"][wildcards.sample]
filename_template = "{filename_prefix}_{{num}}.fastq.gz".format(
filename_prefix=filename_prefix)
filename_template
Output:
'fastq_files/SRR2920475_{num}.fastq.gz'
# We can actually perform string formatting on several variables
# at the same time:
filename_prefix = config["samples"][wildcards.sample]
num = 1
"{filename_prefix}_{num}.fastq.gz".format(
filename_prefix=filename_prefix, num=num)
Output:
'fastq_files/SRR2920475_1.fastq.gz'
# Or, using 3.6 formatted strings
filename_prefix = config["samples"][wildcards.sample]
num = 1
f"{filename_prefix}_{num}.fastq.gz"
Output:
'fastq_files/SRR2920475_1.fastq.gz'
# We could therefore build the result of the expand in a single step:
[f"{config['samples'][wildcards.sample]}_{num}.fastq.gz" for num in [1, 2]]
Output:
['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']
The following is a bit complex, in terms of how Python will build the string:
input:
lambda wildcards: expand(f"fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
But it should work, as we can see in the following simulation:
from collections import namedtuple
from snakemake.io import expand
Wildcards = namedtuple("Wildcards", ["sample"])
wildcards = Wildcards(sample="Corces2016_4983.7A_Mono")
config = {"samples": {
"Corces2016_4983.7A_Mono": "SRR2920475",
"Corces2016_4983.7B_Mono": "SRR2920476"}}
expand(
f"fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz",
num=[1,2])
Output:
['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']
The problem in the trim_galore_pe
rule is actually in its output
section: You shouldn't use {wildcards.sample}
there, but just {sample}
.
The output
section of a rule is where you inform snakemake of what the wildcards attributes will be for a given instance of the rule, by matching the file it wants to obtain with the patterns given. The parts matching the curly braces will be used to set the values of the corresponding attribute name.
For instance, if snakemake wants a file called "trimmed_fastq_files/Corces2016_4983.7A_Mono_1_val_1.fq.gz"
, it will try to match this against all patterns present in all rule's output sections and eventually find this one: "trimmed_fastq_files/{sample}_1_val_1.fq.gz"
Luckily, it will be able to match the filename with the pattern by establishing a correspondence between Corces2016_4983.7A_Mono
and the {sample}
part. It will then put a sample
attribute in the local wildcards instance, a bit like if I was manually doing the following:
Wildcards = namedtuple("Wildcards", ["sample"])
wildcards = Wildcards(sample="Corces2016_4983.7A_Mono")
I don't know what happens exactly in snakemake if you use {wildcards.sample}
instead of {wildcards}
, but let's try with my simulation framework:
Wildcards = namedtuple("Wildcards", ["sample"])
wildcards = Wildcards(wildcards.sample="Corces2016_4983.7A_Mono")
File "<ipython-input-12-c02ce12bff85>", line 1
wildcards = Wildcards(wildcards.sample="Corces2016_4983.7A_Mono")
^
SyntaxError: keyword can't be an expression
What about your following attempt?
output:
expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz", num=[1,2]),
Here, my understanding is that Python first tries to apply the f
string formatting on f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz"
.
To do that, it will need to be able to evaluate config['samples'][wildcards.sample]
, but the wildcards
object does not exist yet. Hence the wildcards not defined
.
wildcards
would be generated only after matching the name of a file needed by a "downstream" rule with a string containing {attribute_name}
patterns. But this is the string that snakemake is currently trying to build.
Here are some important points to remember:
wildcards
actually exist only locally, in an instance of a rule, after having matched its output with a file required by another "downstream" rule instance.wildcards
object. By default, the {attribute_name}
placeholders will be substituted by the attributes of the local wildcards
object ("{sample}"
becomes "Corces2016_4983.7A_Mono"
), but if you want to do more complicated stuff to build the file names, you need to do this via a function that will have to explicitly handle this wildcards
object (lambda wildcards: f"{wildcards.sample}"
becomes "Corces2016_4983.7A_Mono"
).If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With