I'm trying to implement a version of differentially private stochastic gradient descent (e.g., this), which goes as follows:
Compute the gradient with respect to each point in the batch of size L, then clip each of the L gradients separately, then average them together, and then finally perform a (noisy) gradient descent step.
What is the best way to do this in pytorch?
Preferably, there would be a way to simulataneously compute the gradients for each point in the batch:
x # inputs with batch size L
y #true labels
y_output = model(x)
loss = loss_func(y_output,y) #vector of length L
loss.backward() #stores L distinct gradients in each param.grad, magically
But failing that, compute each gradient separately and then clip the norm before accumulating, but
x # inputs with batch size L
y #true labels
y_output = model(x)
loss = loss_func(y_output,y) #vector of length L
for i in range(loss.size()[0]):
loss[i].backward(retain_graph=True)
torch.nn.utils.clip_grad_norm(model.parameters(), clip_size)
accumulates the ith gradient, and then clips, rather than clipping before accumulating it into the gradient. What's the best way to get around this issue?
I don't think you can do much better than the second method in terms of computational efficiency, you're losing the benefits of batching in your backward
and that's a fact. Regarding the order of clipping, autograd stores the gradients in .grad
of parameter tensors. A crude solution would be to add a dictionary like
clipped_grads = {name: torch.zeros_like(param) for name, param in net.named_parameters()}
Run your for loop like
for i in range(loss.size(0)):
loss[i].backward(retain_graph=True)
torch.nn.utils.clip_grad_norm_(net.parameters())
for name, param in net.named_parameters():
clipped_grads[name] += param.grad / loss.size(0)
net.zero_grad()
for name, param in net.named_parameters():
param.grad = clipped_grads[name]
optimizer.step()
where I omitted much of the detach
, requires_grad=False
and similar business which may be necessary to make it behave as expected.
The disadvantage of the above is that you end up storing 2x the memory for your parameter gradients. In principle you could take the "raw" gradient, clip it, add to clipped_gradient
, and then discard as soon as no downstream operations need it, whereas here you retain the raw values in grad
until the end of a backward pass. It may be that register_backward_hook allows you to do that if you go against the guidelines and actually modify the grad_input
, but you would have to verify with someone more intimately acquaintanced with autograd.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With