I am trying to implement neural network with RELU. input layer -> 1 hidden layer -> relu -> output layer -> softmax layer Above is the architecture of my neural network. I am confused about backpropagation of this relu. For derivative of RELU, if x <= 0, output is 0. if x > 0, output is 1. So when you calculate the gradient, does that mean I kill gradient decent if x<=0? Can someone explain the backpropagation of my neural network architecture 'step by step'?

If you have a layer made out of a single ReLU, like your architecture suggests, then yes, you kill the gradient at <code>0</code>. During training, the ReLU will return <code>0</code> to your output layer, which will either return <code>0</code> or <code>0.5</code> if you're using logistic units, and the softmax will squash those. So a value of <code>0</code> under your current architecture doesn't make much sense for the forward propagation part either. See for example this. What you can do is use a "leaky ReLU", which is a small value at <code>0</code>, such as <code>0.01</code>. I would reconsider this architecture however, it doesn't make much sense to me to feed a single ReLU into a bunch of other units then apply a softmax.

Neural network backpropagation with RELU

Q: How does ReLU work in backpropagation?

Generally: A ReLU is a unit that uses the rectifier activation function. That means it works exactly like any other hidden layer but except tanh(x), sigmoid(x) or whatever activation you use, you'll instead use f(x) = max(0,x).

Q: Is activation function used in backpropagation?

This process is known as back-propagation. Activation functions make the back-propagation possible since the gradients are supplied along with the error to update the weights and biases.

Q: What is ReLU with respect to gradient descent?

The gradient of ReLU is 1 for x>0 and 0 for x<0 . It has multiple benefits. The product of gradients of ReLU function doesn't end up converging to 0 as the value is either 0 or 1. If the value is 1, the gradient is back propagated as it is. If it is 0, then no gradient is backpropagated from that point backwards.

Q: Is ReLU better than sigmoid?

Efficiency: ReLu is faster to compute than the sigmoid function, and its derivative is faster to compute. This makes a significant difference to training and inference time for neural networks: only a constant factor, but constants can matter. Simplicity: ReLu is simple.

Tags:

neural-network

backpropagation

I am trying to implement neural network with RELU.

input layer -> 1 hidden layer -> relu -> output layer -> softmax layer

Above is the architecture of my neural network. I am confused about backpropagation of this relu. For derivative of RELU, if x <= 0, output is 0. if x > 0, output is 1. So when you calculate the gradient, does that mean I kill gradient decent if x<=0?

Can someone explain the backpropagation of my neural network architecture 'step by step'?

887

asked Sep 13 '15 03:09

Danny

2 Answers

if x <= 0, output is 0. if x > 0, output is 1

The ReLU function is defined as: For x > 0 the output is x, i.e. f(x) = max(0,x)

So for the derivative f '(x) it's actually:

if x < 0, output is 0. if x > 0, output is 1.

The derivative f '(0) is not defined. So it's usually set to 0 or you modify the activation function to be f(x) = max(e,x) for a small e.

If you have written code for a working multilayer network with sigmoid activation it's literally 1 line of change. Nothing about forward- or back-propagation changes algorithmically. If you haven't got the simpler model working yet, go back and start with that first. Otherwise your question isn't really about ReLUs but about implementing a NN as a whole.

157

answered Sep 28 '22 06:09

runDOSrun

If you have a layer made out of a single ReLU, like your architecture suggests, then yes, you kill the gradient at 0. During training, the ReLU will return 0 to your output layer, which will either return 0 or 0.5 if you're using logistic units, and the softmax will squash those. So a value of 0 under your current architecture doesn't make much sense for the forward propagation part either.

See for example this. What you can do is use a "leaky ReLU", which is a small value at 0, such as 0.01.

I would reconsider this architecture however, it doesn't make much sense to me to feed a single ReLU into a bunch of other units then apply a softmax.

answered Sep 28 '22 07:09

IVlad

Related questions
                            
                                Building a mutlivariate, multi-task LSTM with Keras
                            
                                processing strings of text for neural network input
                            
                                How To Determine the 'filter' Parameter in the Keras Conv2D Function
                            
                                Are GAN's unsupervised or supervised?
                            
                                Keras error : Expected to see 1 array
                            
                                Tensor is not an element of this graph
                            
                                Is there a better way to guess possible unknown variables without brute force than I am doing? Machine learning? [duplicate]
                            
                                Tensorflow: loss decreasing, but accuracy stable
                            
                                How to disable dropout while prediction in keras?
                            
                                ValueError: Variable rnn/basic_rnn_cell/kernel already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope?
                            
                                Batch normalization instead of input normalization
                            
                                Keras Maxpooling2d layer gives ValueError
                            
                                how to implement tensorflow's next_batch for own data
                            
                                Difference between Dense and Activation layer in Keras
                            
                                What is Adaptive average pooling and How does it work?
                            
                                Tensorflow Keras Copy Weights From One Model to Another
                            
                                how to implement custom metric in keras?
                            
                                Tensorflow: Cannot interpret feed_dict key as Tensor
                            
                                TimeDistributed(Dense) vs Dense in Keras - Same number of parameters
                            
                                Keras confusion about number of layers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With