Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why adding one more layer to the Tensorflow simple neural net example breaks it?

Here is a basic Tensorflow network example (based on MNIST), complete code, that gives roughly 0.92 accuracy:

import numpy as np
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

x = tf.placeholder(tf.float32, [None, 784])

W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b)

y_ = tf.placeholder(tf.float32, [None, 10])

cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run() # or 
tf.initialize_all_variables().run()

for _ in range(1000):
  batch_xs, batch_ys = mnist.train.next_batch(100)
  sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

Question: Why adding an extra layer, like in the code below, makes it so much worse that it drops to about 0.11 accuracy?

W = tf.Variable(tf.zeros([784, 100]))
b = tf.Variable(tf.zeros([100]))
h0 = tf.nn.relu(tf.matmul(x, W) + b)

W2 = tf.Variable(tf.zeros([100, 10]))
b2 = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(h0, W2) + b2)
like image 866
Massyanya Avatar asked Jan 03 '23 20:01

Massyanya


2 Answers

The example does not properly initialise weights, but without a hidden layer, it turns out the effective linear softmax regression that the demo does is unaffected by that choice. Setting them all to zero is safe, but only for a single layer network.

When you make a deeper network though, this is a disastrous choice. You must use non-equal initialisation of neural network weights, and the usual quick way to do this is randomly.

Try this:

W = tf.Variable(tf.random_uniform([784, 100], -0.01, 0.01))
b = tf.Variable(tf.zeros([100]))
h0 = tf.nn.relu(tf.matmul(x, W) + b)

W2 = tf.Variable(tf.random_uniform([100, 10], -0.01, 0.01))
b2 = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(h0, W2) + b2)

The reason you need these non-identical weights is to do with how back propagation works - the values of weights in the layer determine how that layer will calculate gradients. If all the weights are the same, then all the gradients will be the same. Which means in turn that all weight updates are the same - everything changes in lockstep, and the behaviour is similar to if you have a single neuron in the hidden layer (because you have multiple neurons all with identical parameters), which can effectively only choose one class.

like image 162
Neil Slater Avatar answered Jan 14 '23 13:01

Neil Slater


Neil explained you nicely how to fix your problem, I will add a little bit of explanation why this happens.

The problem is not so much that the gradients are all the same, but also by the fact the all of them are 0. This happens because relu(Wx + b) = 0 when W = 0 and b = 0. There is even a name for this - dead neuron.

The network does not progress at all and it does not matter whether you train it for 1 step of for 1mln. The results will not be different from a random choice and you see it with your accuracy of 0.11 (if you randomly select stuff you will get 0.10).

like image 39
Salvador Dali Avatar answered Jan 14 '23 15:01

Salvador Dali