Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Backpropagation with Rectified Linear Units

I have written some code to implement backpropagation in a deep neural network with the logistic activation function and softmax output.

def backprop_deep(node_values, targets, weight_matrices):
    delta_nodes = node_values[-1] - targets
    delta_weights = delta_nodes.T.dot(node_values[-2])
    weight_updates = [delta_weights]
    for i in xrange(-2, -len(weight_matrices)- 1, -1):
        delta_nodes = dsigmoid(node_values[i][:,:-1]) * delta_nodes.dot(weight_matrices[i+1])[:,:-1]
        delta_weights = delta_nodes.T.dot(node_values[i-1])
        weight_updates.insert(0, delta_weights)
    return weight_updates

The code works well, but when I switched to ReLU as the activation function it stopped working. In the backprop routine I only change the derivative of the activation function:

def backprop_relu(node_values, targets, weight_matrices):
    delta_nodes = node_values[-1] - targets
    delta_weights = delta_nodes.T.dot(node_values[-2])
    weight_updates = [delta_weights]
    for i in xrange(-2, -len(weight_matrices)- 1, -1):
        delta_nodes = (node_values[i]>0)[:,:-1] * delta_nodes.dot(weight_matrices[i+1])[:,:-1]
        delta_weights = delta_nodes.T.dot(node_values[i-1])
        weight_updates.insert(0, delta_weights)
    return weight_updates

However, the network no longer learns, and the weights quickly go to zero and stay there. I am totally stumped.

like image 592
GuillaumeDufay Avatar asked Apr 04 '15 20:04

GuillaumeDufay


1 Answers

Although I have determined the source of the problem, I'm going to leave this up in case it might be of benefit to someone else.

The problem was that I did not adjust the scale of the initial weights when I changed activation functions. While logistic networks learn very well when node inputs are near zero and the logistic function is approximately linear, ReLU networks learn well for moderately large inputs to nodes. The small weight initialization used in logistic networks is therefore not necessary, and in fact harmful. The behavior I was seeing was the ReLU network ignoring the features and attempting to learn the bias of the training set exclusively.

I am currently using initial weights distributed uniformly from -.5 to .5 on the MNIST dataset, and it is learning very quickly.

like image 136
GuillaumeDufay Avatar answered Nov 10 '22 03:11

GuillaumeDufay