I am just curious to know, how does PyTorch track operations on tensors (after the .requires_grad
is set as True
and how does it later calculate the gradients automatically. Please help me understand the idea behind autograd
. Thanks.
Autograd is reverse automatic differentiation system. Conceptually, autograd records a graph recording all of the operations that created the data as you execute operations, giving you a directed acyclic graph whose leaves are the input tensors and roots are the output tensors.
PyTorch generates derivatives by building a backwards graph behind the scenes, while tensors and backwards functions are the graph's nodes. In a graph, PyTorch computes the derivative of a tensor depending on whether it is a leaf or not.
One property of linear layers is that their gradient is constant : d(alpha*x)/dx = alpha (independant of x ). Therefore the gradients will be identical along all dimensions. Just add non-linear activation layers like sigmoids and this behavior will not happen again.
grad_fn attribute that references a function that has created a function (except for Tensors created by the user - these have None as .
That's a great question!
Generally, the idea of automatic differentiation (AutoDiff
) is based on the multivariable chain rule, i.e.
.
What this means is that you can express the derivative of x with respect to z via a "proxy" variable y; in fact, that allows you to break up almost any operation in a bunch of simpler (or atomic) operations that can then be "chained" together.
Now, what AutoDiff
packages like Autograd
do, is simply to store the derivative of such an atomic operation block, e.g., a division, multiplication, etc.
Then, at runtime, your provided forward pass formula (consisting of multiple of these blocks) can be easily turned into an exact derivative. Likewise, you can also provide derivatives for your own operations, should you think AutoDiff does not exactly do what you want it to.
The advantage of AutoDiff over derivative approximations like finite differences is simply that this is an exact solution.
If you are further interested in how it works internally, I highly recommend the AutoDidact project, which aims to simplify the internals of an automatic differentiator, since there is usually also a lot of code optimization involved. Also, this set of slides from a lecture I took was really helpful in understanding.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With