Suppose I have 2 machines with 4 GPUs each. Suppose that each instance of the training algorithm requires 2 GPUs. I would like to run 4 processes, 2 for each machine, each process using 2 GPUs.
How can I make each process retrieve the number of local processes running on the same machine?
I can detect the world size
with
torch.distributed.get_world_size()
and the global rank
with
torch.distributed.get_rank()
But, given that I would like not to hard code parameters, is there a way to recover that on each node are running 2 processes? This will be usefull to me to assign GPUs to each process equally.
Example: Suppose I know that a machine has 4 GPUs and that there are 2 processes on it, I will assign GPUs [0, 1]
to process with local rank
0 and GPUs [2, 3]
to process with local rank 1. I know total number of processes but I cannot understand if they are on the same machine, so I cannot decide how many GPUs they are allowed to use.
I need a function that would be called torch.distributed.get_local_world_size()
torch.cuda.device_count()
is essentially the local world size and could be useful in determining how many GPUs you have available on each device. If you can't do that for some reason, using plain MPI might help
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank() # device rank - [0,1]
torch.cuda.device(i)
ngpus = torch.cuda.device_count()
print(ngpus, " gpus on machine", i) # here's local world size for each process
but I think it would work just to call torch.cuda.device_count()
in any case without adding this dependency. I am pretty new here so if you can, please let me know how this answer can be improved.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With