I am running a snakemake pipeline on a HPC that uses slurm. The pipeline is rather long, consisting of ~22 steps. Periodically, snakemake will encounted a problem when attempting to submit a job. This reults in the error
sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
Error submitting jobscript (exit code 1):
I run the pipeline via a sbatch file with the following snakemake call
snakemake -j 999 -p --cluster-config cluster.json --cluster 'sbatch --account {cluster.account} --job-name {cluster.job-name} --ntasks-per-node {cluster.ntasks-per-node} --cpus-per-task {threads} --mem {cluster.mem} --partition {cluster.partition} --time {cluster.time} --mail-user {cluster.mail-user} --mail-type {cluster.mail-type} --error {cluster.error} --output {cluster.output}'
This results in not only an output for snakemake sbatch job, but also for the jobs that snakemake creates. The above error appears in the slurm.out for the sbatch file.
The specific job step the error indicates will run successfully, and give output, but the pipeline fails. The logs of the job step show that the job id ran without a problem. I have googled this error, and it appears to happen often with slurm, and especially when the scheduler is under high IO, which suggests it will be an inevitable and regular occurrence. I was hoping someone has encountered this problem, and could offer suggestions for a work around, so that the entire pipeline doesn't fail.
snakemake has an option --max-jobs-per-second
and --max-status-checks-per-second
with default argument of 10. Maybe try decreasing them to reduce strain on the scheduler? Also, maybe try to reduce -j 999
?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With