Why adding one more layer to the Tensorflow simple neural net example breaks it?

Question

Here is a basic Tensorflow network example (based on MNIST), complete code, that gives roughly 0.92 accuracy:

import numpy as np
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

x = tf.placeholder(tf.float32, [None, 784])

W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b)

y_ = tf.placeholder(tf.float32, [None, 10])

cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run() # or 
tf.initialize_all_variables().run()

for _ in range(1000):
  batch_xs, batch_ys = mnist.train.next_batch(100)
  sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

Question: Why adding an extra layer, like in the code below, makes it so much worse that it drops to about 0.11 accuracy?

W = tf.Variable(tf.zeros([784, 100]))
b = tf.Variable(tf.zeros([100]))
h0 = tf.nn.relu(tf.matmul(x, W) + b)

W2 = tf.Variable(tf.zeros([100, 10]))
b2 = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(h0, W2) + b2)

Neil Slater · Accepted Answer

The example does not properly initialise weights, but without a hidden layer, it turns out the effective linear softmax regression that the demo does is unaffected by that choice. Setting them all to zero is safe, but only for a single layer network.

When you make a deeper network though, this is a disastrous choice. You must use non-equal initialisation of neural network weights, and the usual quick way to do this is randomly.

Try this:

W = tf.Variable(tf.random_uniform([784, 100], -0.01, 0.01))
b = tf.Variable(tf.zeros([100]))
h0 = tf.nn.relu(tf.matmul(x, W) + b)

W2 = tf.Variable(tf.random_uniform([100, 10], -0.01, 0.01))
b2 = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(h0, W2) + b2)

The reason you need these non-identical weights is to do with how back propagation works - the values of weights in the layer determine how that layer will calculate gradients. If all the weights are the same, then all the gradients will be the same. Which means in turn that all weight updates are the same - everything changes in lockstep, and the behaviour is similar to if you have a single neuron in the hidden layer (because you have multiple neurons all with identical parameters), which can effectively only choose one class.

Salvador Dali · Answer

Neil explained you nicely how to fix your problem, I will add a little bit of explanation why this happens.

The problem is not so much that the gradients are all the same, but also by the fact the all of them are 0. This happens because relu(Wx + b) = 0 when W = 0 and b = 0. There is even a name for this - dead neuron.

The network does not progress at all and it does not matter whether you train it for 1 step of for 1mln. The results will not be different from a random choice and you see it with your accuracy of 0.11 (if you randomly select stuff you will get 0.10).

Why adding one more layer to the Tensorflow simple neural net example breaks it?

Tags:

python

neural-network

tensorflow

activation-function

Massyanya

2 Answers

Neil Slater

Salvador Dali

Recent Activity

Donate For Us

Why adding one more layer to the Tensorflow simple neural net example breaks it?

Tags:

python

neural-network

tensorflow

activation-function

Massyanya

2 Answers

Neil Slater

Salvador Dali

Related questions

Recent Activity

Donate For Us