Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is Snakemake params function evaluated before input file existence?

Consider this snakefile:

def rdf(fn):
    f = open(fn, "rt")
    t = f.readlines()
    f.close()
    return t

rule a:
    output: "test.txt"
    input: "test.dat"
    params: X=lambda wildcards, input, output, threads, resources: rdf(input[0])
    message: "X is {params.X}"
    shell: "cp {input} {output}"

rule b:
    output: "test.dat"
    shell: "echo 'hello world' >{output}"

When run and neither test.txt nor test.dat exists, it gives this error:

InputFunctionException in line 7 of /Users/tedtoal/Documents/BioinformaticsConsulting/Mars/Cacao/Pipeline/SnakeMake/t2:
FileNotFoundError: [Errno 2] No such file or directory: 'test.dat'

However, if test.dat exists, it runs fine. Why?

I would have expected params not be be evaluated until snakemake was ready to run rule 'a'. Instead, it must call the params function rdf() above during DAG phase prior to running rule 'a'. And yet the following works, even when test.dat does not exist initially:

import os

def rdf(fn):
    if not os.path.exists(fn): return ""
    f = open(fn, "rt")
    t = f.readlines()
    f.close()
    return t

rule a:
    output: "test.txt"
    input: "test.dat"
    params: X=lambda wildcards, input, output, threads, resources: rdf(input[0])
    message: "X is {params.X}"
    shell: "cp {input} {output}"

rule b:
    output: "test.dat"
    shell: "echo 'hello world' >{output}"

This implies that the params are evaluated twice, once during DAG phase and once during rule execution phase. Why?

This is a problem for me. I need to be able to read data from an input file to the rule, to formulate arguments for the program to be executed. The command does not receive the input filename itself, instead it gets arguments derived from the contents of the input file. I can handle it as above, but this seems klugey, and I wonder if there is a bug or I'm missing something?

like image 398
tedtoal Avatar asked Oct 08 '17 22:10

tedtoal


1 Answers

I had the same issue. In my case, I could circumvent the problem by letting the function return a placeholder default when running on non-existing files.

For example, I have a rule which needs to know the number of lines of some of it's input files ahead of time. Therefore, I used:

def count_lines(bed):
    # This is neccessary, because in a dry-run, snakemake will evaluate the 'params' 
    # directive in the (potentiall non-existing) input files. 
    if not Path(bed).exists():
        return -1

    total = 0
    with open(bed) as f:
        for line in f:
            total += 1
    return total
rule subsample_background:
    input:        
        one = "raw/{A}/file.txt",
        two = "raw/{B}/file.txt"
    output:
        "processed/some_output.txt"
    params:
        n = lambda wildcards, input: count_lines(input.one)

    shell:
        "run.sh -n {params.n} {input.B} > {output}"

In the dry-run, a placeholder -1 will be placed, allowing the dry-run to "complete" successfully, while in the non-dry-run, the function will return the appropriate value.

like image 51
Scholar Avatar answered Oct 05 '22 04:10

Scholar