Here is a basic Tensorflow network example (based on MNIST), complete code, that gives roughly 0.92 accuracy:
import numpy as np
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b)
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run() # or
tf.initialize_all_variables().run()
for _ in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))
Question: Why adding an extra layer, like in the code below, makes it so much worse that it drops to about 0.11 accuracy?
W = tf.Variable(tf.zeros([784, 100]))
b = tf.Variable(tf.zeros([100]))
h0 = tf.nn.relu(tf.matmul(x, W) + b)
W2 = tf.Variable(tf.zeros([100, 10]))
b2 = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(h0, W2) + b2)
The example does not properly initialise weights, but without a hidden layer, it turns out the effective linear softmax regression that the demo does is unaffected by that choice. Setting them all to zero is safe, but only for a single layer network.
When you make a deeper network though, this is a disastrous choice. You must use non-equal initialisation of neural network weights, and the usual quick way to do this is randomly.
Try this:
W = tf.Variable(tf.random_uniform([784, 100], -0.01, 0.01))
b = tf.Variable(tf.zeros([100]))
h0 = tf.nn.relu(tf.matmul(x, W) + b)
W2 = tf.Variable(tf.random_uniform([100, 10], -0.01, 0.01))
b2 = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(h0, W2) + b2)
The reason you need these non-identical weights is to do with how back propagation works - the values of weights in the layer determine how that layer will calculate gradients. If all the weights are the same, then all the gradients will be the same. Which means in turn that all weight updates are the same - everything changes in lockstep, and the behaviour is similar to if you have a single neuron in the hidden layer (because you have multiple neurons all with identical parameters), which can effectively only choose one class.
Neil explained you nicely how to fix your problem, I will add a little bit of explanation why this happens.
The problem is not so much that the gradients are all the same, but also by the fact the all of them are 0. This happens because relu(Wx + b) = 0
when W = 0
and b = 0
. There is even a name for this - dead neuron.
The network does not progress at all and it does not matter whether you train it for 1 step of for 1mln. The results will not be different from a random choice and you see it with your accuracy of 0.11 (if you randomly select stuff you will get 0.10).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With