I have just started programming for Neural networks. I am currently working on understanding how a Backpropogation (BP) neural net works. While the algorithm for training in BP nets is quite straightforward, I was unable to find any text on why the algorithm works. More specifically, I am looking for some mathematical reasoning to justify using sigmoid functions in neural nets, and what makes them mimic almost any data distribution thrown at them.
Thanks!
Sigmoid Function acts as an activation function in machine learning which is used to add non-linearity in a machine learning model, in simple words it decides which value to pass as output and what not to pass, there are mainly 7 types of Activation Functions which are used in machine learning and deep learning.
The backpropagation learning rule relies on the fact that the sigmoid function is differentiable, which makes it possible to characterize the rate of change in the output layer error with respect to a change in a particular weight (even if the weight is multiple layers away from the output).
As you can see, the sigmoid is a function that only occupies the range from 0 to 1 and it asymptotes both values. This makes it very handy for binary classification with 0 and 1 as potential output values.
The building block of the deep neural networks is called the sigmoid neuron. Sigmoid neurons are similar to perceptrons, but they are slightly modified such that the output from the sigmoid neuron is much smoother than the step functional output from perceptron.
The sigmoid function introduces non-linearity in the network. Without a non-linear activation function, the net can only learn functions which are linear combinations of its inputs. The result is called universal approximation theorem
or Cybenko theorem
, after the gentleman who proved it in 1989. Wikipedia is a good place to start, and it has a link to the original paper (the proof is somewhat involved though). The reason why you would use a sigmoid as opposed to something else is that it is continuous and differentiable, its derivative is very fast to compute (as opposed to the derivative of tanh, which has similar properties) and has a limited range (from 0 to 1, exclusive)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With