Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sbatch: error: Batch job submission failed: Socket timed out on send/recv operation when running Snakemake

Tags:

I am running a snakemake pipeline on a HPC that uses slurm. The pipeline is rather long, consisting of ~22 steps. Periodically, snakemake will encounted a problem when attempting to submit a job. This reults in the error

sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
Error submitting jobscript (exit code 1):

I run the pipeline via a sbatch file with the following snakemake call

snakemake -j 999 -p --cluster-config cluster.json --cluster 'sbatch --account {cluster.account} --job-name {cluster.job-name} --ntasks-per-node {cluster.ntasks-per-node} --cpus-per-task {threads} --mem {cluster.mem} --partition {cluster.partition} --time {cluster.time} --mail-user {cluster.mail-user} --mail-type {cluster.mail-type} --error {cluster.error} --output {cluster.output}' 

This results in not only an output for snakemake sbatch job, but also for the jobs that snakemake creates. The above error appears in the slurm.out for the sbatch file.

The specific job step the error indicates will run successfully, and give output, but the pipeline fails. The logs of the job step show that the job id ran without a problem. I have googled this error, and it appears to happen often with slurm, and especially when the scheduler is under high IO, which suggests it will be an inevitable and regular occurrence. I was hoping someone has encountered this problem, and could offer suggestions for a work around, so that the entire pipeline doesn't fail.

like image 262
Manninm Avatar asked Oct 23 '19 16:10

Manninm


1 Answers

snakemake has an option --max-jobs-per-second and --max-status-checks-per-second with default argument of 10. Maybe try decreasing them to reduce strain on the scheduler? Also, maybe try to reduce -j 999?

like image 158
dariober Avatar answered Sep 28 '22 17:09

dariober