I've seen multiple issue about the:
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
but none seem to fix it for me:
I've tried to do torch.cuda.set_device(device)
manually at the beginning of every script. That didn't seem to work for me. I've tried different GPUS. I've tried downgrading pytorch version and cuda version. Different combinations of 1.6.0, 1.7.1, 1.8.0 and cuda 10.2, 11.0, 11.1. I am unsure what else to do. What did people do to solve this issue?
very related perhaps?
More complete error message:
('jobid', 4852)
('slurm_jobid', -1)
('slurm_array_task_id', -1)
('condor_jobid', 4852)
('current_time', 'Mar25_16-27-35')
('tb_dir', PosixPath('/home/miranda9/data/logs/logs_Mar25_16-27-35_jobid_4852/tb'))
('gpu_name', 'GeForce GTX TITAN X')
('PID', '30688')
torch.cuda.device_count()=2
opts.world_size=2
ABOUT TO SPAWN WORKERS
done setting sharing strategy...next mp.spawn
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 0
rank=0
mp.current_process()=<SpawnProcess name='SpawnProcess-1' parent=30688 started>
os.getpid()=30704
setting up rank=0 (with world_size=2)
MASTER_ADDR='127.0.0.1'
59264
backend='nccl'
--> done setting up rank=0
setup process done for rank=0
Traceback (most recent call last):
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 279, in <module>
main_distributed()
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 188, in main_distributed
spawn_return = mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size)
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 212, in train
tactic_predictor = move_to_ddp(rank, opts, tactic_predictor)
File "/home/miranda9/ultimate-utils/ultimate-utils-project/uutils/torch/distributed.py", line 162, in move_to_ddp
model = DistributedDataParallel(model, find_unused_parameters=True, device_ids=[opts.gpu])
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__
self._sync_params_and_buffers(authoritative_rank=0)
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
I still have errors:
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Traceback (most recent call last):
File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1423, in <module>
main()
File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1365, in main
train(args=args)
File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1385, in train
args.opt = move_opt_to_cherry_opt_and_sync_params(args) if is_running_parallel(args.rank) else args.opt
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/distributed.py", line 456, in move_opt_to_cherry_opt_and_sync_params
args.opt = cherry.optim.Distributed(args.model.parameters(), opt=args.opt, sync=syn)
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/cherry/optim.py", line 62, in __init__
self.sync_parameters()
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/cherry/optim.py", line 78, in sync_parameters
dist.broadcast(p.data, src=root)
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1090, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
one of the answers suggested to have nvcca & pytorch.version.cuda to match but they do not:
(meta_learning_a100) [miranda9@hal-dgx ~]$ python -c "import torch;print(torch.version.cuda)"
11.1
(meta_learning_a100) [miranda9@hal-dgx ~]$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
How do I match them?
I had the right cuda installed meaning:
python -c "import torch;print(torch.version.cuda)"
#was equal to
nvcc -V
and
ldconfig -v | grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//'
was giving out some version of nccl (e.g., 2.10.3 )
The fix was to remove nccl:
sudo apt remove libnccl2 libnccl-dev
then the libnccl version check was not giving any version, but ddp training was working fine!
This is not a very satisfactory answer but this seems to be what ended up working for me. I simply used pytorch 1.7.1 and it's cuda version 10.2. As long as cuda 11.0 is loaded it seems to be working. To install that version do:
conda install -y pytorch==1.7.1 torchvision torchaudio cudatoolkit=10.2 -c pytorch -c conda-forge
if your are in an HPC do module avail
to make sure the right cuda version is loaded. Perhaps you need to source bash and other things for the submission job to work. My setup looks as follows:
#!/bin/bash
echo JOB STARTED
# a submission job is usually empty and has the root of the submission so you probably need your HOME env var
export HOME=/home/miranda9
# to have modules work and the conda command work
source /etc/bashrc
source /etc/profile
source /etc/profile.d/modules.sh
source ~/.bashrc
source ~/.bash_profile
conda activate metalearningpy1.7.1c10.2
#conda activate metalearning1.7.1c11.1
#conda activate metalearning11.1
#module load cuda-toolkit/10.2
module load cuda-toolkit/11.1
#nvidia-smi
nvcc --version
#conda list
hostname
echo $PATH
which python
# - run script
python -u ~/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py
I also echo other useful things like the nvcc version to make sure load worked (note the top of nvidia-smi doesn't show the right cuda version).
Note I think this is probably just a bug since cuda 11.1 + pytorch 1.8.1 are new as of this writing. I did try
torch.cuda.set_device(opts.gpu) # https://github.com/pytorch/pytorch/issues/54550
but I can't say that it always works or why it doesn't. I do have it in my current code but I think I still get error with pytorch 1.8.x + cuda 11.x.
see my conda list in case it helps:
$ conda list
# packages in environment at /home/miranda9/miniconda3/envs/metalearningpy1.7.1c10.2:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
absl-py 0.12.0 py38h06a4308_0
aioconsole 0.3.1 pypi_0 pypi
aiohttp 3.7.4 py38h27cfd23_1
anatome 0.0.1 pypi_0 pypi
argcomplete 1.12.2 pypi_0 pypi
astunparse 1.6.3 pypi_0 pypi
async-timeout 3.0.1 py38h06a4308_0
attrs 20.3.0 pyhd3eb1b0_0
beautifulsoup4 4.9.3 pyha847dfd_0
blas 1.0 mkl
blinker 1.4 py38h06a4308_0
boto 2.49.0 pypi_0 pypi
brotlipy 0.7.0 py38h27cfd23_1003
bzip2 1.0.8 h7b6447c_0
c-ares 1.17.1 h27cfd23_0
ca-certificates 2021.1.19 h06a4308_1
cachetools 4.2.1 pyhd3eb1b0_0
cairo 1.14.12 h8948797_3
certifi 2020.12.5 py38h06a4308_0
cffi 1.14.0 py38h2e261b9_0
chardet 3.0.4 py38h06a4308_1003
click 7.1.2 pyhd3eb1b0_0
cloudpickle 1.6.0 pypi_0 pypi
conda 4.9.2 py38h06a4308_0
conda-build 3.21.4 py38h06a4308_0
conda-package-handling 1.7.2 py38h03888b9_0
coverage 5.5 py38h27cfd23_2
crcmod 1.7 pypi_0 pypi
cryptography 3.4.7 py38hd23ed53_0
cudatoolkit 10.2.89 hfd86e86_1
cycler 0.10.0 py38_0
cython 0.29.22 py38h2531618_0
dbus 1.13.18 hb2f20db_0
decorator 5.0.3 pyhd3eb1b0_0
dgl-cuda10.2 0.6.0post1 py38_0 dglteam
dill 0.3.3 pyhd3eb1b0_0
expat 2.3.0 h2531618_2
fasteners 0.16 pypi_0 pypi
filelock 3.0.12 pyhd3eb1b0_1
flatbuffers 1.12 pypi_0 pypi
fontconfig 2.13.1 h6c09931_0
freetype 2.10.4 h7ca028e_0 conda-forge
fribidi 1.0.10 h7b6447c_0
future 0.18.2 pypi_0 pypi
gast 0.3.3 pypi_0 pypi
gcs-oauth2-boto-plugin 2.7 pypi_0 pypi
glib 2.63.1 h5a9c865_0
glob2 0.7 pyhd3eb1b0_0
google-apitools 0.5.31 pypi_0 pypi
google-auth 1.28.0 pyhd3eb1b0_0
google-auth-oauthlib 0.4.3 pyhd3eb1b0_0
google-pasta 0.2.0 pypi_0 pypi
google-reauth 0.1.1 pypi_0 pypi
graphite2 1.3.14 h23475e2_0
graphviz 2.40.1 h21bd128_2
grpcio 1.32.0 pypi_0 pypi
gst-plugins-base 1.14.0 hbbd80ab_1
gstreamer 1.14.0 hb453b48_1
gsutil 4.60 pypi_0 pypi
gym 0.18.0 pypi_0 pypi
h5py 2.10.0 pypi_0 pypi
harfbuzz 1.8.8 hffaf4a1_0
higher 0.2.1 pypi_0 pypi
httplib2 0.19.0 pypi_0 pypi
icu 58.2 he6710b0_3
idna 2.10 pyhd3eb1b0_0
importlib-metadata 3.7.3 py38h06a4308_1
intel-openmp 2020.2 254
jinja2 2.11.3 pyhd3eb1b0_0
joblib 1.0.1 pyhd3eb1b0_0
jpeg 9b h024ee3a_2
keras-preprocessing 1.1.2 pypi_0 pypi
kiwisolver 1.3.1 py38h2531618_0
lark-parser 0.6.5 pypi_0 pypi
lcms2 2.11 h396b838_0
ld_impl_linux-64 2.33.1 h53a641e_7
learn2learn 0.1.5 pypi_0 pypi
libarchive 3.4.2 h62408e4_0
libffi 3.2.1 hf484d3e_1007
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
liblief 0.10.1 he6710b0_0
libpng 1.6.37 h21135ba_2 conda-forge
libprotobuf 3.14.0 h8c45485_0
libstdcxx-ng 9.1.0 hdf63c60_0
libtiff 4.1.0 h2733197_1
libuuid 1.0.3 h1bed415_2
libuv 1.40.0 h7b6447c_0
libxcb 1.14 h7b6447c_0
libxml2 2.9.10 hb55368b_3
lmdb 0.94 pypi_0 pypi
lz4-c 1.9.2 he1b5a44_3 conda-forge
markdown 3.3.4 py38h06a4308_0
markupsafe 1.1.1 py38h7b6447c_0
matplotlib 3.3.4 py38h06a4308_0
matplotlib-base 3.3.4 py38h62a2d02_0
memory-profiler 0.58.0 pypi_0 pypi
mkl 2020.2 256
mkl-service 2.3.0 py38h1e0a361_2 conda-forge
mkl_fft 1.3.0 py38h54f3939_0
mkl_random 1.2.0 py38hc5bc63f_1 conda-forge
mock 2.0.0 pypi_0 pypi
monotonic 1.5 pypi_0 pypi
multidict 5.1.0 py38h27cfd23_2
ncurses 6.2 he6710b0_1
networkx 2.5 py_0
ninja 1.10.2 py38hff7bd54_0
numpy 1.19.2 py38h54aff64_0
numpy-base 1.19.2 py38hfa32c7d_0
oauth2client 4.1.3 pypi_0 pypi
oauthlib 3.1.0 py_0
olefile 0.46 pyh9f0ad1d_1 conda-forge
openssl 1.1.1k h27cfd23_0
opt-einsum 3.3.0 pypi_0 pypi
ordered-set 4.0.2 pypi_0 pypi
pandas 1.2.3 py38ha9443f7_0
pango 1.42.4 h049681c_0
patchelf 0.12 h2531618_1
pbr 5.5.1 pypi_0 pypi
pcre 8.44 he6710b0_0
pexpect 4.6.0 pypi_0 pypi
pillow 7.2.0 pypi_0 pypi
pip 21.0.1 py38h06a4308_0
pixman 0.40.0 h7b6447c_0
pkginfo 1.7.0 py38h06a4308_0
progressbar2 3.39.3 pypi_0 pypi
protobuf 3.14.0 py38h2531618_1
psutil 5.8.0 py38h27cfd23_1
ptyprocess 0.7.0 pypi_0 pypi
py-lief 0.10.1 py38h403a769_0
pyasn1 0.4.8 py_0
pyasn1-modules 0.2.8 py_0
pycapnp 1.0.0 pypi_0 pypi
pycosat 0.6.3 py38h7b6447c_1
pycparser 2.20 py_2
pyglet 1.5.0 pypi_0 pypi
pyjwt 1.7.1 py38_0
pyopenssl 20.0.1 pyhd3eb1b0_1
pyparsing 2.4.7 pyhd3eb1b0_0
pyqt 5.9.2 py38h05f1152_4
pysocks 1.7.1 py38h06a4308_0
python 3.8.2 hcf32534_0
python-dateutil 2.8.1 pyhd3eb1b0_0
python-libarchive-c 2.9 pyhd3eb1b0_0
python-utils 2.5.6 pypi_0 pypi
python_abi 3.8 1_cp38 conda-forge
pytorch 1.7.1 py3.8_cuda10.2.89_cudnn7.6.5_0 pytorch
pytz 2021.1 pyhd3eb1b0_0
pyu2f 0.1.5 pypi_0 pypi
pyyaml 5.4.1 py38h27cfd23_1
qt 5.9.7 h5867ecd_1
readline 8.1 h27cfd23_0
requests 2.25.1 pyhd3eb1b0_0
requests-oauthlib 1.3.0 py_0
retry-decorator 1.1.1 pypi_0 pypi
ripgrep 12.1.1 0
rsa 4.7.2 pyhd3eb1b0_1
ruamel_yaml 0.15.100 py38h27cfd23_0
scikit-learn 0.24.1 py38ha9443f7_0
scipy 1.6.2 py38h91f5cce_0
setuptools 52.0.0 py38h06a4308_0
sexpdata 0.0.3 pypi_0 pypi
sip 4.19.13 py38he6710b0_0
six 1.15.0 pyh9f0ad1d_0 conda-forge
soupsieve 2.2.1 pyhd3eb1b0_0
sqlite 3.35.2 hdfb4753_0
tensorboard 2.4.0 pyhc547734_0
tensorboard-plugin-wit 1.6.0 py_0
tensorflow 2.4.1 pypi_0 pypi
tensorflow-estimator 2.4.0 pypi_0 pypi
termcolor 1.1.0 pypi_0 pypi
threadpoolctl 2.1.0 pyh5ca1d4c_0
tk 8.6.10 hbc83047_0
torchaudio 0.7.2 py38 pytorch
torchmeta 1.7.0 pypi_0 pypi
torchtext 0.8.1 py38 pytorch
torchvision 0.8.2 py38_cu102 pytorch
tornado 6.1 py38h27cfd23_0
tqdm 4.56.0 pypi_0 pypi
typing-extensions 3.7.4.3 0
typing_extensions 3.7.4.3 py_0 conda-forge
urllib3 1.26.4 pyhd3eb1b0_0
werkzeug 1.0.1 pyhd3eb1b0_0
wheel 0.36.2 pyhd3eb1b0_0
wrapt 1.12.1 pypi_0 pypi
xz 5.2.5 h7b6447c_0
yaml 0.2.5 h7b6447c_0
yarl 1.6.3 py38h27cfd23_0
zipp 3.4.1 pyhd3eb1b0_0
zlib 1.2.11 h7b6447c_3
zstd 1.4.5 h9ceee32_0
For a100s this seemed to work at some point:
pip3 install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With