I'm new to tensor flow, and have been looking at the examples here. I wanted to rewrite the multilayer perceptron classification model to be a regression model. However I encountered some strange behaviour when modifying the loss function. It works fine with tf.reduce_mean
, but if I try using tf.reduce_sum
it gives nan's in the output. This seems very strange, as the functions are very similar - the only difference is that the mean divides the sum result by the number of elements? So I can't see how nan's could be introduced by this change?
import tensorflow as tf
# Parameters
learning_rate = 0.001
# Network Parameters
n_hidden_1 = 32 # 1st layer number of features
n_hidden_2 = 32 # 2nd layer number of features
n_input = 2 # number of inputs
n_output = 1 # number of outputs
# Make artificial data
SAMPLES = 1000
X = np.random.rand(SAMPLES, n_input)
T = np.c_[X[:,0]**2 + np.sin(X[:,1])]
# tf Graph input
x = tf.placeholder("float", [None, n_input])
y = tf.placeholder("float", [None, n_output])
# Create model
def multilayer_perceptron(x, weights, biases):
# Hidden layer with tanh activation
layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
layer_1 = tf.nn.tanh(layer_1)
# Hidden layer with tanh activation
layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
layer_2 = tf.nn.tanh(layer_2)
# Output layer with linear activation
out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
return out_layer
# Store layers weight & bias
weights = {
'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_hidden_2, n_output]))
}
biases = {
'b1': tf.Variable(tf.random_normal([n_hidden_1])),
'b2': tf.Variable(tf.random_normal([n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_output]))
}
pred = multilayer_perceptron(x, weights, biases)
# Define loss and optimizer
#se = tf.reduce_sum(tf.square(pred - y)) # Why does this give nans?
mse = tf.reduce_mean(tf.square(pred - y)) # When this doesn't?
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(mse)
# Initializing the variables
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
training_epochs = 10
display_step = 1
# Training cycle
for epoch in range(training_epochs):
avg_cost = 0.
# Loop over all batches
for i in range(100):
# Run optimization op (backprop) and cost op (to get loss value)
_, msev = sess.run([optimizer, mse], feed_dict={x: X, y: T})
# Display logs per epoch step
if epoch % display_step == 0:
print("Epoch:", '%04d' % (epoch+1), "mse=", \
"{:.9f}".format(msev))
The problematic variable se
is commented out. It should be used in place of mse
.
With mse
the output looks like this:
Epoch: 0001 mse= 0.051669389
Epoch: 0002 mse= 0.031438075
Epoch: 0003 mse= 0.026629323
...
and with se
it ends up like this:
Epoch: 0001 se= nan
Epoch: 0002 se= nan
Epoch: 0003 se= nan
...
if you use reduce_sum instead of reduce_mean, then the gradient is much larger. Therefore, you should correspondingly narrow down the learning rate to make sure the training process can properly carry on.
The loss by summing across the batch is 1000 times larger (from skimming the code I think your training batch size is 1000) so your gradients and parameter updates are also 1000 times larger. The larger updates apparently lead to nan
s.
Generally learning rates are expressed per example so the loss to find the gradients for updates should be per example also. If the loss is per batch then the learning rate needs to be reduced by the batch size to get comparable training results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With