Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accumulating Gradients

I want to accumulate the gradients before I do a backward pass. So wondering what the right way of doing it is. According to this article it's:

model.zero_grad()                                   # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
    predictions = model(inputs)                     # Forward pass
    loss = loss_function(predictions, labels)       # Compute loss function
    loss = loss / accumulation_steps                # Normalize our loss (if averaged)
    loss.backward()                                 # Backward pass
    if (i+1) % accumulation_steps == 0:             # Wait for several backward steps
        optimizer.step()                            # Now we can do an optimizer step
        model.zero_grad()

whereas I expected it to be:

model.zero_grad()                                   # Reset gradients tensors
loss = 0
for i, (inputs, labels) in enumerate(training_set):
    predictions = model(inputs)                     # Forward pass
    loss += loss_function(predictions, labels)       # Compute loss function                              
    if (i+1) % accumulation_steps == 0:             # Wait for several backward steps
        loss = loss / accumulation_steps            # Normalize our loss (if averaged)
        loss.backward()                             # Backward pass
        optimizer.step()                            # Now we can do an optimizer step
        model.zero_grad()     
        loss = 0  

where I accumulate the loss and then divide by the accumulation steps to average it.

Secondary question, if I am right, would you expect my method to be quicker considering I only do the backward pass every accumulation steps?

like image 858
sachinruk Avatar asked Nov 16 '18 04:11

sachinruk


1 Answers

So according to the answer here, the first method is memory efficient. The amount of work required is more or less the same in both methods.

The second method keeps accumulating the graph, so would require accumulation_steps times more memory. The first method calculates the gradients straight away (and simply adds gradients) so requires less memory.

like image 50
sachinruk Avatar answered Oct 24 '22 03:10

sachinruk