I am trying to run the Keras MINST example using tensorflow-gpu with a Geforce 2080. My environment is Anaconda on a Linux system.
I am running the unmodified example from a command line python session. I get the following output:
Using TensorFlow backend.
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce RTX 2080, pci bus id: 0000:01:00.0, compute capability: 7.5
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
conv2d_1/random_uniform/RandomUniform: (RandomUniform):
/job:localhost/replica:0/task:0/device:GPU:0
conv2d_1/random_uniform/sub: (Sub):
/job:localhost/replica:0/task:0/device:GPU:0
conv2d_1/random_uniform/mul: (Mul):
/job:localhost/replica:0/task:0/device:GPU:0
conv2d_1/random_uniform: (Add):
/job:localhost/replica:0/task:0/device:GPU:0
[...]
The last lines I receive are:
training/Adadelta/Const_31: (Const): /job:localhost/replica:0/task:0/device:GPU:0
training/Adadelta/mul_46/x: (Const): /job:localhost/replica:0/task:0/device:GPU:0
training/Adadelta/mul_47/x: (Const): /job:localhost/replica:0/task:0/device:GPU:0
Segmentation fault (core dumped)
From reading around I assumed this might be a memory problem and added these lines to prevent the GPU from running out of memory:
config = tf.ConfigProto(log_device_placement=True)
config.gpu_options.per_process_gpu_memory_fraction=0.3
K.tensorflow_backend.set_session(tf.Session(config=config))
Checking with the nvidia-smi
tool that the GPU is actually used (watch -n1 nvidia-smi
)I can confirm from the following output (in this run no per_process_gpu_memory_fraction
was set to 1):
I suspect a version incompatibility somewhere between CUDA, Keras and Tensorflow to be the issue, but I don't know, how to debug this.
What debugging measures are available to get to the bottom of this? What other issues might be the reason for this segfault?
EDIT: I experimented further and replacing the model with this code works fine:
model = keras.Sequential([
keras.layers.Flatten(input_shape=input_shape),
keras.layers.Dense(128, activation=tf.nn.relu),
keras.layers.Dense(10, activation=tf.nn.softmax)
])
However once I introduce a convolution layer like so
model = keras.Sequential([
keras.layers.Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape),
# keras.layers.Flatten(input_shape=input_shape),
keras.layers.Flatten(),
keras.layers.Dense(128, activation=tf.nn.relu),
keras.layers.Dense(10, activation=tf.nn.softmax)
])
then I again get the aforementioned segfault.
All packets have been installed through Anaconda. I have installed
EDIT: I tried the same code in a non anaconda environment and it works flawlessly. I would prefer to use anaconda though to avoid system updates breaking things.
In this post, I want to talk about debugging in TensorFlow. It is well known, that program debugging is an integral part of software development, and that the time that is spent debugging, often eclipses the time that it takes to write the original program.
Applied to a TensorFlow training loop, this would imply the ability to test different subsets of the training pipeline, such as the dataset, the loss function, different model layers, and callbacks, separately. This is not always easy to do, as some of the training modules (such as the loss function) are pretty dependent on the other modules.
Class 1: Pixel belonging to the pet. Class 2: Pixel bordering the pet. Class 3: None of the above/a surrounding pixel. The dataset is available from TensorFlow Datasets. The segmentation masks are included in version 3+. In addition, the image color values are normalized to the [0,1] range.
You may also want to see the Tensorflow Object Detection API for another model you can retrain on your own data. Pretrained models are available on TensorFlow Hub Was this helpful?
Build the tensorflow from source(r1.13) .Conv2D segmentation fault fixed.
follow Build from Source
my GPU : RTX 2070 Ubuntu 16.04 Python 3.5.2 Nvidia Driver 410.78 CUDA - 10.0.130 cuDNN-10.0 - 7.4.2.24 TensorRT-5.0.0 Compute Capability: 7.5
Build : tensorflow-1.13.0rc0-cp35-cp35m-linux_x86_64
Download prebuilt from https://github.com/tensorflow/tensorflow/issues/22706
I had the exact same problem on a very similar system as Francois but using a RTX2070 on which I could reliably reproduce the segmentation fault error when using the conv2d function executed on the GPU. My setting:
I finally solved it by building tensorflow from source into a new conda environment. For a fantastic guide see e.g. the following link: https://gist.github.com/Brainiarc7/6d6c3f23ea057775b72c52817759b25c
This is basically like any other build-tensorflow-from-source guide and consisted in my case of the following steps:
./configure
bazel build
command (see link for details)Some minor issues came up during the build, one of which was solved by installing 3 packages manually, using:
pip install keras_applications==1.0.4 --no-deps
pip install keras_preprocessing==1.0.2 --no-deps
pip install h5py==2.8.0
which I found out using this answer here: Error Compiling Tensorflow From Source - No module named 'keras_applications'
conv2d now works like a charm when using the gpu!
However, since all this took a fairly long time (building from source takes over an hour, not counting the search for the solution on the internet) I recommend to make a backup of the system after you get it working, e.g. using timeshift or any other program that you like.
I had the same Conv2D problem with:
Best advice was from this link: https://github.com/tensorflow/tensorflow/issues/24383
So a fix should come with Tensorflow 1.13. In the meantime, using Tensorflow 1.13 nightly build (Dec 26, 2018) + using tensorflow.keras instead of keras solved the issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With