I am trying to run the Keras MINST example using tensorflow-gpu with a Geforce 2080. My environment is Anaconda on a Linux system. I am running the unmodified example from a command line python session. I get the following output: <pre class="prettyprint"><code>Using TensorFlow backend. Device mapping: /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce RTX 2080, pci bus id: 0000:01:00.0, compute capability: 7.5 x_train shape: (60000, 28, 28, 1) 60000 train samples 10000 test samples Train on 60000 samples, validate on 10000 samples Epoch 1/12 conv2d_1/random_uniform/RandomUniform: (RandomUniform): /job:localhost/replica:0/task:0/device:GPU:0 conv2d_1/random_uniform/sub: (Sub): /job:localhost/replica:0/task:0/device:GPU:0 conv2d_1/random_uniform/mul: (Mul): /job:localhost/replica:0/task:0/device:GPU:0 conv2d_1/random_uniform: (Add): /job:localhost/replica:0/task:0/device:GPU:0 [...] </code></pre> The last lines I receive are: <pre class="prettyprint"><code>training/Adadelta/Const_31: (Const): /job:localhost/replica:0/task:0/device:GPU:0 training/Adadelta/mul_46/x: (Const): /job:localhost/replica:0/task:0/device:GPU:0 training/Adadelta/mul_47/x: (Const): /job:localhost/replica:0/task:0/device:GPU:0 Segmentation fault (core dumped) </code></pre> From reading around I assumed this might be a memory problem and added these lines to prevent the GPU from running out of memory: <pre class="prettyprint"><code>config = tf.ConfigProto(log_device_placement=True) config.gpu_options.per_process_gpu_memory_fraction=0.3 K.tensorflow_backend.set_session(tf.Session(config=config)) </code></pre> Checking with the <code>nvidia-smi</code> tool that the GPU is actually used (<code>watch -n1 nvidia-smi</code>)I can confirm from the following output (in this run no <code>per_process_gpu_memory_fraction</code> was set to 1): <img src="https://i.stack.imgur.com/CpmLD.png" alt="enter image description here"> I suspect a version incompatibility somewhere between CUDA, Keras and Tensorflow to be the issue, but I don't know, how to debug this. What debugging measures are available to get to the bottom of this? What other issues might be the reason for this segfault? EDIT: I experimented further and replacing the model with this code works fine: <pre class="prettyprint"><code>model = keras.Sequential([ keras.layers.Flatten(input_shape=input_shape), keras.layers.Dense(128, activation=tf.nn.relu), keras.layers.Dense(10, activation=tf.nn.softmax) ]) </code></pre> However once I introduce a convolution layer like so <pre class="prettyprint"><code>model = keras.Sequential([ keras.layers.Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape), # keras.layers.Flatten(input_shape=input_shape), keras.layers.Flatten(), keras.layers.Dense(128, activation=tf.nn.relu), keras.layers.Dense(10, activation=tf.nn.softmax) </code></pre> ]) then I again get the aforementioned segfault. All packets have been installed through Anaconda. I have installed <ul> <li>conda 4.5.11</li> <li>python 3.6.6</li> <li>keras-gpu 2.2.4</li> <li>tensorflow 1.12.0</li> <li>tensorflow-gpu 1.12.0</li> <li>cudnn 7.2.1</li> <li>cudatoolkit 9.2</li> </ul> EDIT: I tried the same code in a non anaconda environment and it works flawlessly. I would prefer to use anaconda though to avoid system updates breaking things.

I had the exact same problem on a very similar system as Francois but using a RTX2070 on which I could reliably reproduce the segmentation fault error when using the conv2d function executed on the GPU. My setting: <ul> <li>Ubuntu: 18.04</li> <li>GPU: RTX 2070</li> <li>CUDA: 10</li> <li>cudnn: 7</li> <li>conda with python 3.6</li> </ul> I finally solved it by building tensorflow from source into a new conda environment. For a fantastic guide see e.g. the following link: https://gist.github.com/Brainiarc7/6d6c3f23ea057775b72c52817759b25c This is basically like any other build-tensorflow-from-source guide and consisted in my case of the following steps: <ol> <li>insalling bazel</li> <li>cloning tensorflow from git and running <code>./configure</code> </li> <li>running the appropriate <code>bazel build</code> command (see link for details)</li> </ol> Some minor issues came up during the build, one of which was solved by installing 3 packages manually, using: <pre class="prettyprint"><code>pip install keras_applications==1.0.4 --no-deps pip install keras_preprocessing==1.0.2 --no-deps pip install h5py==2.8.0 </code></pre> which I found out using this answer here: Error Compiling Tensorflow From Source - No module named 'keras_applications' conv2d now works like a charm when using the gpu! However, since all this took a fairly long time (building from source takes over an hour, not counting the search for the solution on the internet) I recommend to make a backup of the system after you get it working, e.g. using timeshift or any other program that you like.

How to debug Tensorflow segmentation fault in model.fit()?

Tags:

python

tensorflow

keras

I am trying to run the Keras MINST example using tensorflow-gpu with a Geforce 2080. My environment is Anaconda on a Linux system.

I am running the unmodified example from a command line python session. I get the following output:

Using TensorFlow backend.
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce RTX 2080, pci bus id: 0000:01:00.0, compute capability: 7.5
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
conv2d_1/random_uniform/RandomUniform: (RandomUniform): 
/job:localhost/replica:0/task:0/device:GPU:0
conv2d_1/random_uniform/sub: (Sub): 
/job:localhost/replica:0/task:0/device:GPU:0
conv2d_1/random_uniform/mul: (Mul): 
/job:localhost/replica:0/task:0/device:GPU:0
conv2d_1/random_uniform: (Add): 
/job:localhost/replica:0/task:0/device:GPU:0
[...]

The last lines I receive are:

training/Adadelta/Const_31: (Const): /job:localhost/replica:0/task:0/device:GPU:0
training/Adadelta/mul_46/x: (Const): /job:localhost/replica:0/task:0/device:GPU:0
training/Adadelta/mul_47/x: (Const): /job:localhost/replica:0/task:0/device:GPU:0
Segmentation fault (core dumped)

From reading around I assumed this might be a memory problem and added these lines to prevent the GPU from running out of memory:

config = tf.ConfigProto(log_device_placement=True)
config.gpu_options.per_process_gpu_memory_fraction=0.3
K.tensorflow_backend.set_session(tf.Session(config=config))

Checking with the nvidia-smi tool that the GPU is actually used (watch -n1 nvidia-smi)I can confirm from the following output (in this run no per_process_gpu_memory_fraction was set to 1):

enter image description here

I suspect a version incompatibility somewhere between CUDA, Keras and Tensorflow to be the issue, but I don't know, how to debug this.

What debugging measures are available to get to the bottom of this? What other issues might be the reason for this segfault?

EDIT: I experimented further and replacing the model with this code works fine:

model = keras.Sequential([
    keras.layers.Flatten(input_shape=input_shape),
    keras.layers.Dense(128, activation=tf.nn.relu),
    keras.layers.Dense(10, activation=tf.nn.softmax)
])

However once I introduce a convolution layer like so

model = keras.Sequential([
    keras.layers.Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape),
#    keras.layers.Flatten(input_shape=input_shape),
    keras.layers.Flatten(),
    keras.layers.Dense(128, activation=tf.nn.relu),
    keras.layers.Dense(10, activation=tf.nn.softmax)

])

then I again get the aforementioned segfault.

All packets have been installed through Anaconda. I have installed

conda 4.5.11
python 3.6.6
keras-gpu 2.2.4
tensorflow 1.12.0
tensorflow-gpu 1.12.0
cudnn 7.2.1
cudatoolkit 9.2

EDIT: I tried the same code in a non anaconda environment and it works flawlessly. I would prefer to use anaconda though to avoid system updates breaking things.

347

asked Nov 14 '18 14:11

Herr von Wurst

Video Answer

3 Answers

Build the tensorflow from source(r1.13) .Conv2D segmentation fault fixed.

follow Build from Source

my GPU : RTX 2070 Ubuntu 16.04 Python 3.5.2 Nvidia Driver 410.78 CUDA - 10.0.130 cuDNN-10.0 - 7.4.2.24 TensorRT-5.0.0 Compute Capability: 7.5

Build : tensorflow-1.13.0rc0-cp35-cp35m-linux_x86_64

Download prebuilt from https://github.com/tensorflow/tensorflow/issues/22706

198

answered Oct 18 '22 19:10

bhagath

I had the exact same problem on a very similar system as Francois but using a RTX2070 on which I could reliably reproduce the segmentation fault error when using the conv2d function executed on the GPU. My setting:

Ubuntu: 18.04
GPU: RTX 2070
CUDA: 10
cudnn: 7
conda with python 3.6

I finally solved it by building tensorflow from source into a new conda environment. For a fantastic guide see e.g. the following link: https://gist.github.com/Brainiarc7/6d6c3f23ea057775b72c52817759b25c

This is basically like any other build-tensorflow-from-source guide and consisted in my case of the following steps:

insalling bazel
cloning tensorflow from git and running ./configure
running the appropriate bazel build command (see link for details)

Some minor issues came up during the build, one of which was solved by installing 3 packages manually, using:

pip install keras_applications==1.0.4 --no-deps
pip install keras_preprocessing==1.0.2 --no-deps
pip install h5py==2.8.0

which I found out using this answer here: Error Compiling Tensorflow From Source - No module named 'keras_applications'

conv2d now works like a charm when using the gpu!

However, since all this took a fairly long time (building from source takes over an hour, not counting the search for the solution on the internet) I recommend to make a backup of the system after you get it working, e.g. using timeshift or any other program that you like.

answered Oct 18 '22 19:10

Laurin Herbsthofer

I had the same Conv2D problem with:

Ubuntu 18.04
Graphic card: GeForce RTX 2080
CUDA: cuda_10.0.130_410
CUDNN: cudnn-10.0-linux-x64-v7.4.2
conda with Python 3.6

Best advice was from this link: https://github.com/tensorflow/tensorflow/issues/24383

So a fix should come with Tensorflow 1.13. In the meantime, using Tensorflow 1.13 nightly build (Dec 26, 2018) + using tensorflow.keras instead of keras solved the issue.

answered Oct 18 '22 21:10

Francois Robert

Related questions
                            
                                Access child class variable in parent class
                            
                                Determine if object is of type Foo without importing type Foo
                            
                                Spark streaming with python: how to add a UUID column?
                            
                                Append a level to a pandas MultiIndex
                            
                                How to get the value of a tensor? Python
                            
                                Recycling in Pandas Dataframe
                            
                                Does a default parameters overwrite type hints for mypy?
                            
                                Download multiple file from Google cloud storage using Python
                            
                                Python 3.5 create .rpm with pyinstaller generated executable
                            
                                What does rtype mean in Python?
                            
                                Python script in Power BI returns date as Microsoft.OleDb.Date
                            
                                Group by and aggregate columns but create NaN if values do not match
                            
                                How to check an object has the type 'dict_items'?
                            
                                How is ternary operator implemented in Python
                            
                                Possible Combination of Parentheses in a Matrix Chain Application
                            
                                Converting a DateTime Index value to an Index Number
                            
                                Implementing ROC Curves for K-NN machine learning algorithm using python and Scikit Learn
                            
                                Pickling dict in Python
                            
                                Sorting pandas dataframe by weekdays
                            
                                numpy find the max value in a row and return back to it's column index

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With