How does the back-propagation algorithm deal with non-differentiable activation functions?

Tags:

While digging through the topic of neural networks and how to efficiently train them, I came across the method of using very simple activation functions, such as the rectified linear unit (ReLU), instead of the classic smooth sigmoids. The ReLU-function is not differentiable at the origin, so according to my understanding the backpropagation algorithm (BPA) is not suitable for training a neural network with ReLUs, since the chain rule of multivariable calculus refers to smooth functions only. However, none of the papers about using ReLUs that I read address this issue. ReLUs seem to be very effective and seem to be used virtually everywhere while not causing any unexpected behavior. Can somebody explain to me why ReLUs can be trained at all via the backpropagation algorithm?

889

asked May 14 '15 11:05

Yugw O'yu

1 Answers

To understand how backpropagation is even possible with functions like ReLU you need to understand what is the most important property of derivative that makes backpropagation algorithm works so well. This property is that :

f(x) ~ f(x0) + f'(x0)(x - x0)

If you treat x0 as actual value of your parameter at the moment - you can tell (knowing value of a cost function and it's derivative) how the cost function will behave when you change your parameters a little bit. This is most crucial thing in backpropagation.

Because of the fact that computing cost function is crucial for a cost computation - you will need your cost function to satisfy the property stated above. It's easy to check that ReLU satisfy this property everywhere except a small neighbourhood of 0. And this is the only problem with ReLU - the fact that we cannot use this property when we are close to 0.

To overcome that you may choose the value of ReLU derivative in 0 to either 1 or 0. On the other hand most of researchers don't treat this problem as serious simply because of the fact, that being close to 0 during ReLU computations is relatively rare.

From the above - of course - from the pure mathematical point of view it's not plausible to use ReLU with backpropagation algorithm. On the other hand - in practice it usually doesn't make any difference that it has this weird behaviour around 0.

176

answered Sep 28 '22 05:09

Marcin Możejko

Related questions
                            
                                Fair division of a kingdom [closed]
                            
                                Is it possible to select multiple elements in the Chrome Developer Tools Elements panel?
                            
                                Template tricks with const char* as a non-type parameter
                            
                                Instance initializer and *this* keyword [duplicate]
                            
                                what is the advantage of using FutureTask over Callable?
                            
                                using java configuration for n-factor authentication [closed]
                            
                                What issues can arise from checking in .vssscc and .vspscc files into source control?
                            
                                Loading 100-200K markers on google map
                            
                                How do I get JUnit XML output from Jest?
                            
                                Webpack Missing Module 'Module Not Found'
                            
                                When is --thunder-lock beneficial?
                            
                                Why do gcc and clang allow me to construct an abstract class?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With