Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set

Tags:

python

pytorch

I am not able to initialize the group process in PyTorch for BERT model I had tried to initialize using following code:

import torch
import datetime

torch.distributed.init_process_group(
    backend='nccl',
    init_method='env://',
    timeout=datetime.timedelta(0, 1800),
    world_size=0,
    rank=0,
    store=None,
    group_name=''
)

and tried to access the get_world_size() function:

num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()

full code:

train_examples = None
    num_train_optimization_steps = None
    if do_train:
        train_examples = processor.get_train_examples(data_dir)
        num_train_optimization_steps = int(
            len(train_examples) / train_batch_size / gradient_accumulation_steps) * num_train_epochs
        if local_rank != -1:
            import datetime
            torch.distributed.init_process_group(backend='nccl',init_method='env://', timeout=datetime.timedelta(0, 1800), world_size=0, rank=0, store=None, group_name='')
            num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()
            print(num_train_optimization_steps)
like image 947
Suraj Ghale Avatar asked Jan 24 '26 01:01

Suraj Ghale


1 Answers

I solve the problem by referring https://github.com/NVIDIA/apex/issues/99. Specifically run

python -m torch.distributed.launch xxx.py
like image 106
Dandelion Avatar answered Jan 25 '26 21:01

Dandelion



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!