I am training the same model on two different machines, but the trained models are not identical. I have taken the following measures to ensure reproducibility:
# set random number 
random.seed(0)
torch.cuda.manual_seed(0)
np.random.seed(0)
# set the cudnn
torch.backends.cudnn.benchmark=False
torch.backends.cudnn.deterministic=True
# set data loader work threads to be 0
DataLoader(dataset, num_works=0)
When I train the same model multiple times on the same machine, the trained model is always the same. However, the trained models on two different machines are not the same. Is this normal? Are there any other tricks I can employ?
There are a number of areas that could additionally introduce randomness e.g:
PyTorch random number generator
You can use
torch.manual_seed()to seed the RNG for all devices (both CPU and CUDA):
CUDA convolution determinism
While disabling CUDA convolution benchmarking (discussed above) ensures that CUDA selects the same algorithm each time an application is run, that algorithm itself may be nondeterministic, unless either
torch.use_deterministic_algorithms(True)ortorch.backends.cudnn.deterministic = Trueis set. The latter setting controls only this behavior, unliketorch.use_deterministic_algorithms()which will make other PyTorch operations behave deterministically, too.CUDA RNN and LSTM
In some versions of CUDA, RNNs and LSTM networks may have non-deterministic behavior. See
torch.nn.RNN()andtorch.nn.LSTM()for details and workarounds.
DataLoader
DataLoaderwill reseed workers following Randomness in multi-process data loading algorithm. Useworker_init_fn()to preserve reproducibility:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With