Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Neural network backpropagation with RELU

I am trying to implement neural network with RELU.

input layer -> 1 hidden layer -> relu -> output layer -> softmax layer

Above is the architecture of my neural network. I am confused about backpropagation of this relu. For derivative of RELU, if x <= 0, output is 0. if x > 0, output is 1. So when you calculate the gradient, does that mean I kill gradient decent if x<=0?

Can someone explain the backpropagation of my neural network architecture 'step by step'?

like image 887
Danny Avatar asked Sep 13 '15 03:09

Danny


People also ask

How does ReLU work in backpropagation?

Generally: A ReLU is a unit that uses the rectifier activation function. That means it works exactly like any other hidden layer but except tanh(x), sigmoid(x) or whatever activation you use, you'll instead use f(x) = max(0,x).

Is activation function used in backpropagation?

This process is known as back-propagation. Activation functions make the back-propagation possible since the gradients are supplied along with the error to update the weights and biases.

What is ReLU with respect to gradient descent?

The gradient of ReLU is 1 for x>0 and 0 for x<0 . It has multiple benefits. The product of gradients of ReLU function doesn't end up converging to 0 as the value is either 0 or 1. If the value is 1, the gradient is back propagated as it is. If it is 0, then no gradient is backpropagated from that point backwards.

Is ReLU better than sigmoid?

Efficiency: ReLu is faster to compute than the sigmoid function, and its derivative is faster to compute. This makes a significant difference to training and inference time for neural networks: only a constant factor, but constants can matter. Simplicity: ReLu is simple.


2 Answers

if x <= 0, output is 0. if x > 0, output is 1

The ReLU function is defined as: For x > 0 the output is x, i.e. f(x) = max(0,x)

So for the derivative f '(x) it's actually:

if x < 0, output is 0. if x > 0, output is 1.

The derivative f '(0) is not defined. So it's usually set to 0 or you modify the activation function to be f(x) = max(e,x) for a small e.

Generally: A ReLU is a unit that uses the rectifier activation function. That means it works exactly like any other hidden layer but except tanh(x), sigmoid(x) or whatever activation you use, you'll instead use f(x) = max(0,x).

If you have written code for a working multilayer network with sigmoid activation it's literally 1 line of change. Nothing about forward- or back-propagation changes algorithmically. If you haven't got the simpler model working yet, go back and start with that first. Otherwise your question isn't really about ReLUs but about implementing a NN as a whole.

like image 157
runDOSrun Avatar answered Sep 28 '22 06:09

runDOSrun


If you have a layer made out of a single ReLU, like your architecture suggests, then yes, you kill the gradient at 0. During training, the ReLU will return 0 to your output layer, which will either return 0 or 0.5 if you're using logistic units, and the softmax will squash those. So a value of 0 under your current architecture doesn't make much sense for the forward propagation part either.

See for example this. What you can do is use a "leaky ReLU", which is a small value at 0, such as 0.01.

I would reconsider this architecture however, it doesn't make much sense to me to feed a single ReLU into a bunch of other units then apply a softmax.

like image 34
IVlad Avatar answered Sep 28 '22 07:09

IVlad