Accumulating Gradients

Question

I want to accumulate the gradients before I do a backward pass. So wondering what the right way of doing it is. According to this article it's:

model.zero_grad()                                   # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
    predictions = model(inputs)                     # Forward pass
    loss = loss_function(predictions, labels)       # Compute loss function
    loss = loss / accumulation_steps                # Normalize our loss (if averaged)
    loss.backward()                                 # Backward pass
    if (i+1) % accumulation_steps == 0:             # Wait for several backward steps
        optimizer.step()                            # Now we can do an optimizer step
        model.zero_grad()

whereas I expected it to be:

model.zero_grad()                                   # Reset gradients tensors
loss = 0
for i, (inputs, labels) in enumerate(training_set):
    predictions = model(inputs)                     # Forward pass
    loss += loss_function(predictions, labels)       # Compute loss function                              
    if (i+1) % accumulation_steps == 0:             # Wait for several backward steps
        loss = loss / accumulation_steps            # Normalize our loss (if averaged)
        loss.backward()                             # Backward pass
        optimizer.step()                            # Now we can do an optimizer step
        model.zero_grad()     
        loss = 0

where I accumulate the loss and then divide by the accumulation steps to average it.

Secondary question, if I am right, would you expect my method to be quicker considering I only do the backward pass every accumulation steps?

sachinruk · Accepted Answer

So according to the answer here, the first method is memory efficient. The amount of work required is more or less the same in both methods.

The second method keeps accumulating the graph, so would require accumulation_steps times more memory. The first method calculates the gradients straight away (and simply adds gradients) so requires less memory.

Accumulating Gradients

Tags:

python

pytorch

gradient-descent

sachinruk

1 Answers

sachinruk

Recent Activity

Donate For Us

Accumulating Gradients

Tags:

python

pytorch

gradient-descent

sachinruk

1 Answers

sachinruk

Related questions

Recent Activity

Donate For Us