I want to accumulate the gradients before I do a backward pass. So wondering what the right way of doing it is. According to this article it's:
model.zero_grad() # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs) # Forward pass
loss = loss_function(predictions, labels) # Compute loss function
loss = loss / accumulation_steps # Normalize our loss (if averaged)
loss.backward() # Backward pass
if (i+1) % accumulation_steps == 0: # Wait for several backward steps
optimizer.step() # Now we can do an optimizer step
model.zero_grad()
whereas I expected it to be:
model.zero_grad() # Reset gradients tensors
loss = 0
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs) # Forward pass
loss += loss_function(predictions, labels) # Compute loss function
if (i+1) % accumulation_steps == 0: # Wait for several backward steps
loss = loss / accumulation_steps # Normalize our loss (if averaged)
loss.backward() # Backward pass
optimizer.step() # Now we can do an optimizer step
model.zero_grad()
loss = 0
where I accumulate the loss and then divide by the accumulation steps to average it.
Secondary question, if I am right, would you expect my method to be quicker considering I only do the backward pass every accumulation steps?
So according to the answer here, the first method is memory efficient. The amount of work required is more or less the same in both methods.
The second method keeps accumulating the graph, so would require accumulation_steps
times more memory. The first method calculates the gradients straight away (and simply adds gradients) so requires less memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With