Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In distributed computing, what are world size and rank?

I've been reading through some documentation and example code with the end goal of writing scripts for distributed computing (running PyTorch), but the concepts confuse me.

Let's assume that we have a single node with 4 GPUs, and we want to run our script on those 4 GPUs (i.e. one process per GPU). In such a scenario, what are the rank world size and rank? I often find the explanation for world size: Total number of processes involved in the job, so I assume that that is four in our example, but what about rank?

To explain it further, another example with multiple nodes and multiple GPUs could be useful, too.

like image 973
Bram Vanroy Avatar asked Oct 07 '19 14:10

Bram Vanroy


2 Answers

These concepts are related to parallel computing. It would be helpful to learn a little about parallel computing, e.g., MPI.

You can think of world as a group containing all the processes for your distributed training. Usually, each GPU corresponds to one process. Processes in the world can communicate with each other, which is why you can train your model distributedly and still get the correct gradient update. So world size is the number of processes for your training, which is usually the number of GPUs you are using for distributed training.

Rank is the unique ID given to a process, so that other processes know how to identify a particular process. Local rank is the a unique local ID for processes running in a single node, this is where my view differs with @zihaozhihao.

Let's take a concrete example. Suppose we run our training in 2 servers or nodes and each with 4 GPUs. The world size is 4*2=8. The ranks for the processes will be [0, 1, 2, 3, 4, 5, 6, 7]. In each node, the local rank will be [0, 1, 2, 3].

I have also written a post about MPI collectives and basic concepts. The link is here.

like image 69
jdhao Avatar answered Nov 15 '22 16:11

jdhao


When I was learning torch.distributed, I was also confused by those terms. The followings are based on my own understanding and the API documents, please correct me if I'm wrong.

I think group should be understood correctly first. It can be thought as "group of processes" or "world", and one job is corresponding to one group usually. world_size is the number of processes in this group, which is also the number of processes participating in the job. rank is a unique id for each process in the group.

So in your example, world_size is 4 and rank for the processes is [0,1,2,3].

Sometimes, we could also have local_rank argument, it means the GPU id inside one process. For example, rank=1 and local_rank=1, it means the second GPU in the second process.

like image 37
zihaozhihao Avatar answered Nov 15 '22 16:11

zihaozhihao