From the docs:
requires_grad – Boolean indicating whether the Variable has been created by a subgraph containing any Variable, that requires it. Can be changed only on leaf Variables
requires_grad indicates whether a variable is trainable. By default, requires_grad is False in creating a Variable. If one of the input to an operation requires gradient, its output and its subgraphs will also require gradient.
Setting requires_grad Parameter , that allows for fine-grained exclusion of subgraphs from gradient computation. It takes effect in both the forward and backward passes: During the forward pass, an operation is only recorded in the backward graph if at least one of its input tensors require grad.
requires_grad_ (requires_grad=True) → Tensor. Change if autograd should record operations on this tensor: sets this tensor's requires_grad attribute in-place. Returns this tensor. requires_grad_() 's main use case is to tell autograd to begin recording operations on a Tensor tensor .
In PyTorch leaf nodes are therefore the values from which the computation begins. Here a simple program illustrating this: # The following two values are the leaf nodes x=T. ones(10, requires_grad=True) y=T. ones(10, requires_grad=True) # The remaining nodes are not leaves: def H(z1, z2): return T.
Leaf nodes of a graph are those nodes (i.e. Variables
) that were not computed directly from other nodes in the graph. For example:
import torch
from torch.autograd import Variable
A = Variable(torch.randn(10,10)) # this is a leaf node
B = 2 * A # this is not a leaf node
w = Variable(torch.randn(10,10)) # this is a leaf node
C = A.mm(w) # this is not a leaf node
If a leaf node requires_grad
, all subsequent nodes computed from it will automatically also require_grad
. Else, you could not apply the chain rule to calculate the gradient of the leaf node which requires_grad
. This is the reason why requires_grad
can only be set for leaf nodes: For all others, it can be smartly inferred and is in fact determined by the settings of the leaf nodes used for computing these other variables.
Note that in a typical neural network, all parameters are leaf nodes. They are not computed from any other Variables
in the network. Hence, freezing layers using requires_grad
is simple. Here, is an example taken from the PyTorch docs:
model = torchvision.models.resnet18(pretrained=True)
for param in model.parameters():
param.requires_grad = False
# Replace the last fully-connected layer
# Parameters of newly constructed modules have requires_grad=True by default
model.fc = nn.Linear(512, 100)
# Optimize only the classifier
optimizer = optim.SGD(model.fc.parameters(), lr=1e-2, momentum=0.9)
Even though, what you really do is freezing the entire gradient computation (which is what you should be doing as it avoids unnecessary computation). Technically, you could leave the requires_grad
flag on, and only define your optimizer for a subset of the parameters that you would like to learn.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With