Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDNN_STATUS_NOT_INITIALIZED when trying to run TensorFlow

I have installed TensorFlow 1.7 on an Ubuntu 16.04 with Cuda 9.0 and CuDNN 7.0.5 and vanilla Python 2.7 and although they samples for both CUDA and CuDNN run fine, and TensorFlow sees the GPU (so some TensorFlow examples run), those that use CuDNN (like most CNN examples) do not. They fail with these Informational messages:

2018-04-10 16:14:17.013026: I tensorflow/stream_executor/plugin_registry.cc:243] Selecting default DNN plugin, cuDNN
25428 2018-04-10 16:14:17.013100: E tensorflow/stream_executor/cuda/cuda_dnn.cc:403] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
25429 2018-04-10 16:14:17.013119: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  384.130  Wed Mar 21 03:37:26 PDT 2018
25430 GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)
25431 """
25432 2018-04-10 16:14:17.013131: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:112] version string "384.130" made value 384.130.0
25433 2018-04-10 16:14:17.013135: E tensorflow/stream_executor/cuda/cuda_dnn.cc:411] possibly insufficient driver version: 384.130.0
25434 2018-04-10 16:14:17.013139: E tensorflow/stream_executor/cuda/cuda_dnn.cc:370] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
25435 2018-04-10 16:14:17.013143: F tensorflow/core/kernels/conv_ops.cc:712] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)

Turning on a flood of VLOG messages (see my link below for how to do this) did not produce any additional relevant messages.

The key message here might be "Selecting default DNN plugin, cuDNN", because looking at the code I might think that it can't load the cuDNN library modules, but for all I know it is actually normal (so not a warning) and the problem could be something else.

For example the "CUDNN_STATUS_NOT_INITIALIZED" message seems to have been caused in an earlier version by TF too aggressively allocating memory ahead of time (found this in the TF GitHub issues list) so CuDNN could not initialize, but I tried those remedies (including resetting the GPU and rebooting), but they did not help.

Any ideas as to what I should try next?

like image 766
Mike Wise Avatar asked Apr 11 '18 08:04

Mike Wise


1 Answers

Ok, I found this, it was caused by me having the wrong version of cuDNN installed, so my suspicion that it was not actually finding the correct shared library was true.

Basically I installed cuDNN v7.1.2 for Cuda 9.1 instead of cuDNN v7.1.2 for Cuda 9.0, which seems to have been causing it to silently fail - although I would have expected an error message at this point. Note that I had detailed VLOGs running, (see my answer on this post for more information on how to do that Turning on TF Logs):

When I installed cuDNN v7.1.2 for Cuda 9.0 it did in fact find it and complain that that version was not new enough - when in fact the real problem was that it was not old enough, but at least I had some real data to work with.

In the end cuDNN v7.0.5 for Cuda 9.0 was what I needed and that worked.

like image 139
Mike Wise Avatar answered Sep 21 '22 01:09

Mike Wise