Loss function works with reduce_mean but not reduce_sum

Question

I'm new to tensor flow, and have been looking at the examples here. I wanted to rewrite the multilayer perceptron classification model to be a regression model. However I encountered some strange behaviour when modifying the loss function. It works fine with tf.reduce_mean, but if I try using tf.reduce_sum it gives nan's in the output. This seems very strange, as the functions are very similar - the only difference is that the mean divides the sum result by the number of elements? So I can't see how nan's could be introduced by this change?

import tensorflow as tf

# Parameters
learning_rate = 0.001

# Network Parameters
n_hidden_1 = 32 # 1st layer number of features
n_hidden_2 = 32 # 2nd layer number of features
n_input = 2 # number of inputs
n_output = 1 # number of outputs

# Make artificial data
SAMPLES = 1000
X = np.random.rand(SAMPLES, n_input)
T = np.c_[X[:,0]**2 + np.sin(X[:,1])]

# tf Graph input
x = tf.placeholder("float", [None, n_input])
y = tf.placeholder("float", [None, n_output])

# Create model
def multilayer_perceptron(x, weights, biases):
    # Hidden layer with tanh activation
    layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
    layer_1 = tf.nn.tanh(layer_1)
    # Hidden layer with tanh activation
    layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
    layer_2 = tf.nn.tanh(layer_2)
    # Output layer with linear activation
    out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
    return out_layer

# Store layers weight & bias
weights = {
    'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
    'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
    'out': tf.Variable(tf.random_normal([n_hidden_2, n_output]))
}
biases = {
    'b1': tf.Variable(tf.random_normal([n_hidden_1])),
    'b2': tf.Variable(tf.random_normal([n_hidden_2])),
    'out': tf.Variable(tf.random_normal([n_output]))
}

pred = multilayer_perceptron(x, weights, biases)

# Define loss and optimizer
#se = tf.reduce_sum(tf.square(pred - y))   # Why does this give nans?
mse = tf.reduce_mean(tf.square(pred - y))  # When this doesn't?
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(mse)

# Initializing the variables
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)

training_epochs = 10
display_step = 1

# Training cycle
for epoch in range(training_epochs):
    avg_cost = 0.
    # Loop over all batches
    for i in range(100):
        # Run optimization op (backprop) and cost op (to get loss value)
        _, msev = sess.run([optimizer, mse], feed_dict={x: X, y: T})
    # Display logs per epoch step
    if epoch % display_step == 0:
        print("Epoch:", '%04d' % (epoch+1), "mse=", \
            "{:.9f}".format(msev))

The problematic variable se is commented out. It should be used in place of mse.

With mse the output looks like this:

Epoch: 0001 mse= 0.051669389
Epoch: 0002 mse= 0.031438075
Epoch: 0003 mse= 0.026629323
...

and with se it ends up like this:

Epoch: 0001 se= nan
Epoch: 0002 se= nan
Epoch: 0003 se= nan
...

albert cai · Accepted Answer

if you use reduce_sum instead of reduce_mean, then the gradient is much larger. Therefore, you should correspondingly narrow down the learning rate to make sure the training process can properly carry on.

user728291 · Answer

The loss by summing across the batch is 1000 times larger (from skimming the code I think your training batch size is 1000) so your gradients and parameter updates are also 1000 times larger. The larger updates apparently lead to nans.

Generally learning rates are expressed per example so the loss to find the gradients for updates should be per example also. If the loss is per batch then the learning rate needs to be reduced by the batch size to get comparable training results.

Loss function works with reduce_mean but not reduce_sum

Tags:

tensorflow

J. P. Petersen

2 Answers

albert cai

user728291

Recent Activity

Donate For Us

Loss function works with reduce_mean but not reduce_sum

Tags:

tensorflow

J. P. Petersen

2 Answers

albert cai

user728291

Related questions

Recent Activity

Donate For Us