SLURM and python, nodes are allocated, but the code only runs on one node

Tags:

I have a 4*64 CPU cluster. I installed SLURM, and it seems to be working, as if i call sbatch i get the proper allocation and queue. However if i use more than 64 cores (so basically more than 1 node) it perfectly allocates the correct amount of nodes, but if i ssh into the allocated nodes i only see actual work in one of them. The rest just sits there doing nothing.

My code is complex, and it uses multiprocessing. I call pools with like 300 workers, so i guess it should not be the problem.

What i would like to achieve is to call sbatch myscript.py on like 200 cores, and SLURM should distribute my run on these 200 cores, not just allocate the correct amount of nodes but actually only use one.

The header of my python script looks like this:

#!/usr/bin/python3

#SBATCH --output=SLURM_%j.log
#SBATCH --partition=part
#SBATCH -n 200

and i call the script with sbatch myscript.py.

338

asked Nov 30 '16 10:11

Gábor Erdős

3 Answers

Unfortunately, multiprocessing does not allow working on several nodes. From the documentation:

the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine

One option, often used with Slurm, is to use MPI (with the MPI4PY package) but MPI is considered to be the 'the assembly language of parallel programming' and you will need to modify your code extensibly.

Another option is to look into the Parallel Processing packages for one that suits your needs and requires minimal changes to your code. See also this other question for more insights.

A final note: it is perfectly fine to put the #SBATCH directives in the Python script and use the Python shebang. But as Slurm executes a copy of the script rather than the script itself, you must add a line such as

sys.path.append(os.getcwd())

at the beginning of the script (but after the #SBATCH lines) to make sure Python finds any module located in your directory.

answered Oct 25 '22 07:10

damienfrancois

I think your sbatch script should not be inside the python script. Rather it should be a normal bash script including the #SBATCH options followed by the actual script to run with srun jobs. like the following:

#!/usr/bin/bash

#SBATCH --output=SLURM_%j.log
#SBATCH --partition=part
#SBATCH -n 200

srun python3 myscript.py

I suggest testing this with a simple python script like this:

import multiprocessing as mp

def main():
    print("cpus =", mp.cpu_count())

if __name__ == "__main__":
    main()

answered Oct 25 '22 07:10

Lukisn

I tried to get around using different python libraries by using srun on the following bash script. srun should run on each node that you have allocated to you. The basic idea is that it determines what node it's running on and assigns a node id of 0, 1, ... , nnodes-1. Then it passes that information off to the python program along with a thread id. In the program I combine these two numbers to make a distinct id for each cpu on each node. This code assumes that there are 16 cores on each node and 10 nodes are going to be used.

#!/bin/bash

nnames=(`scontrol show hostnames`)
nnodes=${#nnames[@]}
nIDs=`seq 0 $(($nnodes-1))`
nID=0
for i in $nIDs
do
    hname=`hostname`
    if [ "${nnames[$i]}" == "$hname" ]
        then nID=$i
    fi
done
tIDs=`seq 0 15`

for tID in $tIDs
do
    python testDataFitting2.py $nID $tID 160 &
done
wait

answered Oct 25 '22 05:10

user1585635

Related questions
                            
                                Python: Extract multiple float numbers from string
                            
                                How to group pandas DataFrame by varying dates?
                            
                                Python - Iterate through a list of strings and group partial matching strings
                            
                                Create a copy and not a reference of a NumPy array
                            
                                Pandas crosstab - How to print rows/columns for values that don't exist in the data sets?
                            
                                optional list argument "list = list or []" in python
                            
                                Combinations of two non-consecutive items
                            
                                Specify Newline character ('\n') in reading csv using Python
                            
                                Cannot get distinct record - Django w / Rest Framework
                            
                                Django + Postgres: save JSON string directly into model as JSON type
                            
                                Advanced slicing when passed list instead of tuple in numpy
                            
                                Why can't a module be a context manager (to a 'with' statement)?
                            
                                Django migrate tables to new database
                            
                                Python Add Elements to Lists within List if Missing
                            
                                Set diagonal triangle in pandas DataFrame to NaN
                            
                                Numbers as variable names not recognized by statsmodels.formula.api
                            
                                insert matrix into the center of another matrix in python
                            
                                Pythonic way of writing a single-line long string
                            
                                Defining variable in if [duplicate]
                            
                                Pandas (Python) - Update column of a dataframe from another one with conditions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SLURM and python, nodes are allocated, but the code only runs on one node

Tags:

python

python-3.x

slurm