Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to debug Tensorflow segmentation fault in model.fit()?

I am trying to run the Keras MINST example using tensorflow-gpu with a Geforce 2080. My environment is Anaconda on a Linux system.

I am running the unmodified example from a command line python session. I get the following output:

Using TensorFlow backend.
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce RTX 2080, pci bus id: 0000:01:00.0, compute capability: 7.5
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
conv2d_1/random_uniform/RandomUniform: (RandomUniform): 
/job:localhost/replica:0/task:0/device:GPU:0
conv2d_1/random_uniform/sub: (Sub): 
/job:localhost/replica:0/task:0/device:GPU:0
conv2d_1/random_uniform/mul: (Mul): 
/job:localhost/replica:0/task:0/device:GPU:0
conv2d_1/random_uniform: (Add): 
/job:localhost/replica:0/task:0/device:GPU:0
[...]

The last lines I receive are:

training/Adadelta/Const_31: (Const): /job:localhost/replica:0/task:0/device:GPU:0
training/Adadelta/mul_46/x: (Const): /job:localhost/replica:0/task:0/device:GPU:0
training/Adadelta/mul_47/x: (Const): /job:localhost/replica:0/task:0/device:GPU:0
Segmentation fault (core dumped)

From reading around I assumed this might be a memory problem and added these lines to prevent the GPU from running out of memory:

config = tf.ConfigProto(log_device_placement=True)
config.gpu_options.per_process_gpu_memory_fraction=0.3
K.tensorflow_backend.set_session(tf.Session(config=config))

Checking with the nvidia-smi tool that the GPU is actually used (watch -n1 nvidia-smi)I can confirm from the following output (in this run no per_process_gpu_memory_fraction was set to 1):

enter image description here

I suspect a version incompatibility somewhere between CUDA, Keras and Tensorflow to be the issue, but I don't know, how to debug this.

What debugging measures are available to get to the bottom of this? What other issues might be the reason for this segfault?

EDIT: I experimented further and replacing the model with this code works fine:

model = keras.Sequential([
    keras.layers.Flatten(input_shape=input_shape),
    keras.layers.Dense(128, activation=tf.nn.relu),
    keras.layers.Dense(10, activation=tf.nn.softmax)
])

However once I introduce a convolution layer like so

model = keras.Sequential([
    keras.layers.Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape),
#    keras.layers.Flatten(input_shape=input_shape),
    keras.layers.Flatten(),
    keras.layers.Dense(128, activation=tf.nn.relu),
    keras.layers.Dense(10, activation=tf.nn.softmax)

])

then I again get the aforementioned segfault.

All packets have been installed through Anaconda. I have installed

  • conda 4.5.11
  • python 3.6.6
  • keras-gpu 2.2.4
  • tensorflow 1.12.0
  • tensorflow-gpu 1.12.0
  • cudnn 7.2.1
  • cudatoolkit 9.2

EDIT: I tried the same code in a non anaconda environment and it works flawlessly. I would prefer to use anaconda though to avoid system updates breaking things.

like image 347
Herr von Wurst Avatar asked Nov 14 '18 14:11

Herr von Wurst


People also ask

What about debugging in TensorFlow?

In this post, I want to talk about debugging in TensorFlow. It is well known, that program debugging is an integral part of software development, and that the time that is spent debugging, often eclipses the time that it takes to write the original program.

Is it possible to test a TensorFlow training loop?

Applied to a TensorFlow training loop, this would imply the ability to test different subsets of the training pipeline, such as the dataset, the loss function, different model layers, and callbacks, separately. This is not always easy to do, as some of the training modules (such as the loss function) are pretty dependent on the other modules.

What is a Class 3 pixel in TensorFlow?

Class 1: Pixel belonging to the pet. Class 2: Pixel bordering the pet. Class 3: None of the above/a surrounding pixel. The dataset is available from TensorFlow Datasets. The segmentation masks are included in version 3+. In addition, the image color values are normalized to the [0,1] range.

Can I retrain on my own data in TensorFlow?

You may also want to see the Tensorflow Object Detection API for another model you can retrain on your own data. Pretrained models are available on TensorFlow Hub Was this helpful?


Video Answer


3 Answers

Build the tensorflow from source(r1.13) .Conv2D segmentation fault fixed.

follow Build from Source

my GPU : RTX 2070 Ubuntu 16.04 Python 3.5.2 Nvidia Driver 410.78 CUDA - 10.0.130 cuDNN-10.0 - 7.4.2.24 TensorRT-5.0.0 Compute Capability: 7.5

Build : tensorflow-1.13.0rc0-cp35-cp35m-linux_x86_64

Download prebuilt from https://github.com/tensorflow/tensorflow/issues/22706

like image 198
bhagath Avatar answered Oct 18 '22 19:10

bhagath


I had the exact same problem on a very similar system as Francois but using a RTX2070 on which I could reliably reproduce the segmentation fault error when using the conv2d function executed on the GPU. My setting:

  • Ubuntu: 18.04
  • GPU: RTX 2070
  • CUDA: 10
  • cudnn: 7
  • conda with python 3.6

I finally solved it by building tensorflow from source into a new conda environment. For a fantastic guide see e.g. the following link: https://gist.github.com/Brainiarc7/6d6c3f23ea057775b72c52817759b25c

This is basically like any other build-tensorflow-from-source guide and consisted in my case of the following steps:

  1. insalling bazel
  2. cloning tensorflow from git and running ./configure
  3. running the appropriate bazel build command (see link for details)

Some minor issues came up during the build, one of which was solved by installing 3 packages manually, using:

pip install keras_applications==1.0.4 --no-deps
pip install keras_preprocessing==1.0.2 --no-deps
pip install h5py==2.8.0

which I found out using this answer here: Error Compiling Tensorflow From Source - No module named 'keras_applications'

conv2d now works like a charm when using the gpu!

However, since all this took a fairly long time (building from source takes over an hour, not counting the search for the solution on the internet) I recommend to make a backup of the system after you get it working, e.g. using timeshift or any other program that you like.

like image 20
Laurin Herbsthofer Avatar answered Oct 18 '22 19:10

Laurin Herbsthofer


I had the same Conv2D problem with:

  • Ubuntu 18.04
  • Graphic card: GeForce RTX 2080
  • CUDA: cuda_10.0.130_410
  • CUDNN: cudnn-10.0-linux-x64-v7.4.2
  • conda with Python 3.6

Best advice was from this link: https://github.com/tensorflow/tensorflow/issues/24383

So a fix should come with Tensorflow 1.13. In the meantime, using Tensorflow 1.13 nightly build (Dec 26, 2018) + using tensorflow.keras instead of keras solved the issue.

like image 2
Francois Robert Avatar answered Oct 18 '22 21:10

Francois Robert