I'm a newbie when it comes to AWS and Tensorflow and I've been learning about CNNs over the last week via Udacity's Machine Learning course. Now I've a need to use an AWS instance of a GPU. I launched a p2.xlarge instance of Deep Learning AMI with Source Code (CUDA 8, Ubuntu) (that's what they recommended)
But now, it seems that tensorflow is not using the GPU at all. It's still training using the CPU. I did some searching and I found some answers to this problem and none of them seemed to work.
When I run the Jupyter notebook, it still uses the CPU
What do I do to get it to run on the GPU and not the CPU?
In Ubuntu 18.04 LTS, the latest conda works well in resolving dependency issues of packages for the newest version of python. Thus, all you have to do is run conda create --name tf_gpu and then conda activate tf_gpu to activate it. Then conda install tensorflow-gpu , which should do it.
If TensorFlow doesn't detect your GPU, it will default to the CPU, which means when doing heavy training jobs, these will take a really long time to complete. This is most likely because the CUDA and CuDNN drivers are not being correctly detected in your system.
TensorFlow supports running computations on a variety of types of devices, including CPU and GPU.
The problem of tensorflow not detecting GPU can possibly be due to one of the following reasons.
Before proceeding to solve the issue, we assume that the installed environment is an AWS Deep Learning AMI having CUDA 8.0 and tensorflow version 1.4.1 installed. This assumption is derived from the discussion in comments.
To solve the problem, we proceed as follows:
pip freeze | grep tensorflow
pip uninstall tensorflow
pip install tensorflow-gpu==1.4.1
pip uninstall tensorflow
pip uninstall tensorflow-gpu
pip install tensorflow-gpu==1.4.1
At this point, if all the dependencies of tensorflow are installed correctly, tensorflow GPU version should work fine. A common error at this stage (as encountered by OP) is the missing cuDNN library which can result in following error while importing tensorflow into a python module
ImportError: libcudnn.so.6: cannot open shared object file: No such file or directory
It can be fixed by installing the correct version of NVIDIA's cuDNN library. Tensorflow version 1.4.1 depends upon cuDNN version 6.0 and CUDA 8, so we download the corresponding version from cuDNN archive page (Download Link). We have to login to the NVIDIA developer account to be able to download the file, therefore it is not possible to download it using command line tools such as wget
or curl
. A possible solution is to download the file on host system and use scp
to copy it onto AWS.
Once copied to AWS, extract the file using the following command:
tar -xzvf cudnn-8.0-linux-x64-v6.0.tgz
The extracted directory should have structure similar to the CUDA toolkit installation directory. Assuming that CUDA toolkit is installed in the directory /usr/local/cuda
, we can install cuDNN by copying the files from the downloaded archive into corresponding folders of CUDA Toolkit installation directory followed by linker update command ldconfig
as follows:
cp cuda/include/* /usr/local/cuda/include
cp cuda/lib64/* /usr/local/cuda/lib64
ldconfig
After this, we should be able to import tensorflow GPU version into our python modules.
A few considerations:
pip
should be replaced with pip3
.pip
, cp
and ldconfig
may require to be run as sudo
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With