The Wikipedia page for backpropagation has this claim:
The backpropagation algorithm for calculating a gradient has been rediscovered a number of times, and is a special case of a more general technique called automatic differentiation in the reverse accumulation mode.
Can someone expound on this, put it in layman's terms? What is the function being differentiated? What is the "special case"? Is it the adjoint values themselves that are used or the final gradient?
Update: since writing this I have found that this is covered in the Deep Learning book, section 6.5.9. See https://www.deeplearningbook.org/ . I have also found this paper to be informative on the subject: "Stable architectures for deep neural networks" by Haber and Ruthotto.
Backpropagation is a special case of an extraordinarily powerful programming abstraction called automatic differentiation (AD).
I think the difference is that back-propagation refers to the updating of weights with respect to their gradient to minimize a function; "back-propagating the gradients" is a typical term used. Conversely, reverse-mode diff merely means calculating the gradient of a function.
Stochastic gradient descent is an optimization algorithm for minimizing the loss of a predictive model with regard to a training dataset. Back-propagation is an automatic differentiation algorithm for calculating gradients for the weights in a neural network graph structure.
The Backpropagation algorithm is suitable for the feed forward neural network on fixed sized input-output pairs. The Backpropagation Through Time is the application of Backpropagation training algorithm which is applied to the sequence data like the time series. It is applied to the recurrent neural network.
"What is the function being differentiated? What is the "special case?""
The most important distinction between backpropagation and reverse-mode AD is that reverse-mode AD computes the vector-Jacobian product of a vector valued function from R^n -> R^m, while backpropagation computes the gradient of a scalar valued function from R^n -> R. Backpropagation is therefore a special case of reverse-mode AD for scalar functions.
When we train neural networks, we always have a scalar-valued loss function, so we are always using backpropagation. This is the function being differentiated. Since backprop is a subset of reverse-mode AD, then we are also using reverse-mode AD when we train a neural network.
"Is it the adjoint values themselves that are used or the final gradient?"
The adjoint of a variable is the gradient of the loss function with respect to that variable. When we do neural network training, we use the gradients of the parameters (like weights, biases, etc) with respect to the loss to update the parameters. So we do use the adjoints, but only the adjoints of the parameters (which are equivalent to the gradient of the parameters).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With