Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to solve the famous `unhandled cuda error, NCCL version 2.7.8` error?

I've seen multiple issue about the:

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

but none seem to fix it for me:

  • https://github.com/pytorch/pytorch/issues/54550
  • https://github.com/pytorch/pytorch/issues/47885
  • https://github.com/pytorch/pytorch/issues/50921
  • https://github.com/pytorch/pytorch/issues/54823

I've tried to do torch.cuda.set_device(device) manually at the beginning of every script. That didn't seem to work for me. I've tried different GPUS. I've tried downgrading pytorch version and cuda version. Different combinations of 1.6.0, 1.7.1, 1.8.0 and cuda 10.2, 11.0, 11.1. I am unsure what else to do. What did people do to solve this issue?


very related perhaps?

  • Pytorch "NCCL error": unhandled system error, NCCL version 2.4.8"

More complete error message:

('jobid', 4852)
('slurm_jobid', -1)
('slurm_array_task_id', -1)
('condor_jobid', 4852)
('current_time', 'Mar25_16-27-35')
('tb_dir', PosixPath('/home/miranda9/data/logs/logs_Mar25_16-27-35_jobid_4852/tb'))
('gpu_name', 'GeForce GTX TITAN X')
('PID', '30688')
torch.cuda.device_count()=2

opts.world_size=2

ABOUT TO SPAWN WORKERS
done setting sharing strategy...next mp.spawn
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 0
rank=0
mp.current_process()=<SpawnProcess name='SpawnProcess-1' parent=30688 started>
os.getpid()=30704
setting up rank=0 (with world_size=2)
MASTER_ADDR='127.0.0.1'
59264
backend='nccl'
--> done setting up rank=0
setup process done for rank=0
Traceback (most recent call last):
  File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 279, in <module>
    main_distributed()
  File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 188, in main_distributed
    spawn_return = mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size)
  File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 212, in train
    tactic_predictor = move_to_ddp(rank, opts, tactic_predictor)
  File "/home/miranda9/ultimate-utils/ultimate-utils-project/uutils/torch/distributed.py", line 162, in move_to_ddp
    model = DistributedDataParallel(model, find_unused_parameters=True, device_ids=[opts.gpu])
  File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
    self._distributed_broadcast_coalesced(
  File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

Bonus 1:

I still have errors:

ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Traceback (most recent call last):
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1423, in <module>
    main()
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1365, in main
    train(args=args)
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1385, in train
    args.opt = move_opt_to_cherry_opt_and_sync_params(args) if is_running_parallel(args.rank) else args.opt
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/distributed.py", line 456, in move_opt_to_cherry_opt_and_sync_params
    args.opt = cherry.optim.Distributed(args.model.parameters(), opt=args.opt, sync=syn)
  File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/cherry/optim.py", line 62, in __init__
    self.sync_parameters()
  File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/cherry/optim.py", line 78, in sync_parameters
    dist.broadcast(p.data, src=root)
  File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1090, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8

one of the answers suggested to have nvcca & pytorch.version.cuda to match but they do not:

(meta_learning_a100) [miranda9@hal-dgx ~]$ python -c "import torch;print(torch.version.cuda)"

11.1
(meta_learning_a100) [miranda9@hal-dgx ~]$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

How do I match them?

like image 571
Charlie Parker Avatar asked Mar 25 '21 20:03

Charlie Parker


2 Answers

I had the right cuda installed meaning:

python -c "import torch;print(torch.version.cuda)"

#was equal to 

nvcc -V

and

ldconfig -v | grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//' 

was giving out some version of nccl (e.g., 2.10.3 )

The fix was to remove nccl:

sudo apt remove libnccl2 libnccl-dev

then the libnccl version check was not giving any version, but ddp training was working fine!

like image 160
Sadra Naddaf Avatar answered Nov 10 '22 11:11

Sadra Naddaf


This is not a very satisfactory answer but this seems to be what ended up working for me. I simply used pytorch 1.7.1 and it's cuda version 10.2. As long as cuda 11.0 is loaded it seems to be working. To install that version do:

conda install -y pytorch==1.7.1 torchvision torchaudio cudatoolkit=10.2 -c pytorch -c conda-forge

if your are in an HPC do module avail to make sure the right cuda version is loaded. Perhaps you need to source bash and other things for the submission job to work. My setup looks as follows:

#!/bin/bash

echo JOB STARTED

# a submission job is usually empty and has the root of the submission so you probably need your HOME env var
export HOME=/home/miranda9
# to have modules work and the conda command work
source /etc/bashrc
source /etc/profile
source /etc/profile.d/modules.sh
source ~/.bashrc
source ~/.bash_profile

conda activate metalearningpy1.7.1c10.2
#conda activate metalearning1.7.1c11.1
#conda activate metalearning11.1

#module load cuda-toolkit/10.2
module load cuda-toolkit/11.1

#nvidia-smi
nvcc --version
#conda list
hostname
echo $PATH
which python

# - run script
python -u ~/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py

I also echo other useful things like the nvcc version to make sure load worked (note the top of nvidia-smi doesn't show the right cuda version).

Note I think this is probably just a bug since cuda 11.1 + pytorch 1.8.1 are new as of this writing. I did try

            torch.cuda.set_device(opts.gpu)  # https://github.com/pytorch/pytorch/issues/54550

but I can't say that it always works or why it doesn't. I do have it in my current code but I think I still get error with pytorch 1.8.x + cuda 11.x.

see my conda list in case it helps:

$ conda list


# packages in environment at /home/miranda9/miniconda3/envs/metalearningpy1.7.1c10.2:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
absl-py                   0.12.0           py38h06a4308_0  
aioconsole                0.3.1                    pypi_0    pypi
aiohttp                   3.7.4            py38h27cfd23_1  
anatome                   0.0.1                    pypi_0    pypi
argcomplete               1.12.2                   pypi_0    pypi
astunparse                1.6.3                    pypi_0    pypi
async-timeout             3.0.1            py38h06a4308_0  
attrs                     20.3.0             pyhd3eb1b0_0  
beautifulsoup4            4.9.3              pyha847dfd_0  
blas                      1.0                         mkl  
blinker                   1.4              py38h06a4308_0  
boto                      2.49.0                   pypi_0    pypi
brotlipy                  0.7.0           py38h27cfd23_1003  
bzip2                     1.0.8                h7b6447c_0  
c-ares                    1.17.1               h27cfd23_0  
ca-certificates           2021.1.19            h06a4308_1  
cachetools                4.2.1              pyhd3eb1b0_0  
cairo                     1.14.12              h8948797_3  
certifi                   2020.12.5        py38h06a4308_0  
cffi                      1.14.0           py38h2e261b9_0  
chardet                   3.0.4           py38h06a4308_1003  
click                     7.1.2              pyhd3eb1b0_0  
cloudpickle               1.6.0                    pypi_0    pypi
conda                     4.9.2            py38h06a4308_0  
conda-build               3.21.4           py38h06a4308_0  
conda-package-handling    1.7.2            py38h03888b9_0  
coverage                  5.5              py38h27cfd23_2  
crcmod                    1.7                      pypi_0    pypi
cryptography              3.4.7            py38hd23ed53_0  
cudatoolkit               10.2.89              hfd86e86_1  
cycler                    0.10.0                   py38_0  
cython                    0.29.22          py38h2531618_0  
dbus                      1.13.18              hb2f20db_0  
decorator                 5.0.3              pyhd3eb1b0_0  
dgl-cuda10.2              0.6.0post1               py38_0    dglteam
dill                      0.3.3              pyhd3eb1b0_0  
expat                     2.3.0                h2531618_2  
fasteners                 0.16                     pypi_0    pypi
filelock                  3.0.12             pyhd3eb1b0_1  
flatbuffers               1.12                     pypi_0    pypi
fontconfig                2.13.1               h6c09931_0  
freetype                  2.10.4               h7ca028e_0    conda-forge
fribidi                   1.0.10               h7b6447c_0  
future                    0.18.2                   pypi_0    pypi
gast                      0.3.3                    pypi_0    pypi
gcs-oauth2-boto-plugin    2.7                      pypi_0    pypi
glib                      2.63.1               h5a9c865_0  
glob2                     0.7                pyhd3eb1b0_0  
google-apitools           0.5.31                   pypi_0    pypi
google-auth               1.28.0             pyhd3eb1b0_0  
google-auth-oauthlib      0.4.3              pyhd3eb1b0_0  
google-pasta              0.2.0                    pypi_0    pypi
google-reauth             0.1.1                    pypi_0    pypi
graphite2                 1.3.14               h23475e2_0  
graphviz                  2.40.1               h21bd128_2  
grpcio                    1.32.0                   pypi_0    pypi
gst-plugins-base          1.14.0               hbbd80ab_1  
gstreamer                 1.14.0               hb453b48_1  
gsutil                    4.60                     pypi_0    pypi
gym                       0.18.0                   pypi_0    pypi
h5py                      2.10.0                   pypi_0    pypi
harfbuzz                  1.8.8                hffaf4a1_0  
higher                    0.2.1                    pypi_0    pypi
httplib2                  0.19.0                   pypi_0    pypi
icu                       58.2                 he6710b0_3  
idna                      2.10               pyhd3eb1b0_0  
importlib-metadata        3.7.3            py38h06a4308_1  
intel-openmp              2020.2                      254  
jinja2                    2.11.3             pyhd3eb1b0_0  
joblib                    1.0.1              pyhd3eb1b0_0  
jpeg                      9b                   h024ee3a_2  
keras-preprocessing       1.1.2                    pypi_0    pypi
kiwisolver                1.3.1            py38h2531618_0  
lark-parser               0.6.5                    pypi_0    pypi
lcms2                     2.11                 h396b838_0  
ld_impl_linux-64          2.33.1               h53a641e_7  
learn2learn               0.1.5                    pypi_0    pypi
libarchive                3.4.2                h62408e4_0  
libffi                    3.2.1             hf484d3e_1007  
libgcc-ng                 9.1.0                hdf63c60_0  
libgfortran-ng            7.3.0                hdf63c60_0  
liblief                   0.10.1               he6710b0_0  
libpng                    1.6.37               h21135ba_2    conda-forge
libprotobuf               3.14.0               h8c45485_0  
libstdcxx-ng              9.1.0                hdf63c60_0  
libtiff                   4.1.0                h2733197_1  
libuuid                   1.0.3                h1bed415_2  
libuv                     1.40.0               h7b6447c_0  
libxcb                    1.14                 h7b6447c_0  
libxml2                   2.9.10               hb55368b_3  
lmdb                      0.94                     pypi_0    pypi
lz4-c                     1.9.2                he1b5a44_3    conda-forge
markdown                  3.3.4            py38h06a4308_0  
markupsafe                1.1.1            py38h7b6447c_0  
matplotlib                3.3.4            py38h06a4308_0  
matplotlib-base           3.3.4            py38h62a2d02_0  
memory-profiler           0.58.0                   pypi_0    pypi
mkl                       2020.2                      256  
mkl-service               2.3.0            py38h1e0a361_2    conda-forge
mkl_fft                   1.3.0            py38h54f3939_0  
mkl_random                1.2.0            py38hc5bc63f_1    conda-forge
mock                      2.0.0                    pypi_0    pypi
monotonic                 1.5                      pypi_0    pypi
multidict                 5.1.0            py38h27cfd23_2  
ncurses                   6.2                  he6710b0_1  
networkx                  2.5                        py_0  
ninja                     1.10.2           py38hff7bd54_0  
numpy                     1.19.2           py38h54aff64_0  
numpy-base                1.19.2           py38hfa32c7d_0  
oauth2client              4.1.3                    pypi_0    pypi
oauthlib                  3.1.0                      py_0  
olefile                   0.46               pyh9f0ad1d_1    conda-forge
openssl                   1.1.1k               h27cfd23_0  
opt-einsum                3.3.0                    pypi_0    pypi
ordered-set               4.0.2                    pypi_0    pypi
pandas                    1.2.3            py38ha9443f7_0  
pango                     1.42.4               h049681c_0  
patchelf                  0.12                 h2531618_1  
pbr                       5.5.1                    pypi_0    pypi
pcre                      8.44                 he6710b0_0  
pexpect                   4.6.0                    pypi_0    pypi
pillow                    7.2.0                    pypi_0    pypi
pip                       21.0.1           py38h06a4308_0  
pixman                    0.40.0               h7b6447c_0  
pkginfo                   1.7.0            py38h06a4308_0  
progressbar2              3.39.3                   pypi_0    pypi
protobuf                  3.14.0           py38h2531618_1  
psutil                    5.8.0            py38h27cfd23_1  
ptyprocess                0.7.0                    pypi_0    pypi
py-lief                   0.10.1           py38h403a769_0  
pyasn1                    0.4.8                      py_0  
pyasn1-modules            0.2.8                      py_0  
pycapnp                   1.0.0                    pypi_0    pypi
pycosat                   0.6.3            py38h7b6447c_1  
pycparser                 2.20                       py_2  
pyglet                    1.5.0                    pypi_0    pypi
pyjwt                     1.7.1                    py38_0  
pyopenssl                 20.0.1             pyhd3eb1b0_1  
pyparsing                 2.4.7              pyhd3eb1b0_0  
pyqt                      5.9.2            py38h05f1152_4  
pysocks                   1.7.1            py38h06a4308_0  
python                    3.8.2                hcf32534_0  
python-dateutil           2.8.1              pyhd3eb1b0_0  
python-libarchive-c       2.9                pyhd3eb1b0_0  
python-utils              2.5.6                    pypi_0    pypi
python_abi                3.8                      1_cp38    conda-forge
pytorch                   1.7.1           py3.8_cuda10.2.89_cudnn7.6.5_0    pytorch
pytz                      2021.1             pyhd3eb1b0_0  
pyu2f                     0.1.5                    pypi_0    pypi
pyyaml                    5.4.1            py38h27cfd23_1  
qt                        5.9.7                h5867ecd_1  
readline                  8.1                  h27cfd23_0  
requests                  2.25.1             pyhd3eb1b0_0  
requests-oauthlib         1.3.0                      py_0  
retry-decorator           1.1.1                    pypi_0    pypi
ripgrep                   12.1.1                        0  
rsa                       4.7.2              pyhd3eb1b0_1  
ruamel_yaml               0.15.100         py38h27cfd23_0  
scikit-learn              0.24.1           py38ha9443f7_0  
scipy                     1.6.2            py38h91f5cce_0  
setuptools                52.0.0           py38h06a4308_0  
sexpdata                  0.0.3                    pypi_0    pypi
sip                       4.19.13          py38he6710b0_0  
six                       1.15.0             pyh9f0ad1d_0    conda-forge
soupsieve                 2.2.1              pyhd3eb1b0_0  
sqlite                    3.35.2               hdfb4753_0  
tensorboard               2.4.0              pyhc547734_0  
tensorboard-plugin-wit    1.6.0                      py_0  
tensorflow                2.4.1                    pypi_0    pypi
tensorflow-estimator      2.4.0                    pypi_0    pypi
termcolor                 1.1.0                    pypi_0    pypi
threadpoolctl             2.1.0              pyh5ca1d4c_0  
tk                        8.6.10               hbc83047_0  
torchaudio                0.7.2                      py38    pytorch
torchmeta                 1.7.0                    pypi_0    pypi
torchtext                 0.8.1                      py38    pytorch
torchvision               0.8.2                py38_cu102    pytorch
tornado                   6.1              py38h27cfd23_0  
tqdm                      4.56.0                   pypi_0    pypi
typing-extensions         3.7.4.3                       0  
typing_extensions         3.7.4.3                    py_0    conda-forge
urllib3                   1.26.4             pyhd3eb1b0_0  
werkzeug                  1.0.1              pyhd3eb1b0_0  
wheel                     0.36.2             pyhd3eb1b0_0  
wrapt                     1.12.1                   pypi_0    pypi
xz                        5.2.5                h7b6447c_0  
yaml                      0.2.5                h7b6447c_0  
yarl                      1.6.3            py38h27cfd23_0  
zipp                      3.4.1              pyhd3eb1b0_0  
zlib                      1.2.11               h7b6447c_3  
zstd                      1.4.5                h9ceee32_0 

For a100s this seemed to work at some point:

pip3 install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
like image 43
Charlie Parker Avatar answered Nov 10 '22 12:11

Charlie Parker