Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can i process multi loss in pytorch?

Tags:

python

pytorch

enter image description here

Such as this, I want to using some auxiliary loss to promoting my model performance.
Which type code can implement it in pytorch?

#one loss1.backward() loss2.backward() loss3.backward() optimizer.step() #two loss1.backward() optimizer.step()  loss2.backward() optimizer.step()  loss3.backward() optimizer.step()    #three loss = loss1+loss2+loss3 loss.backward() optimizer.step() 

Thanks for your answer!

like image 304
heiheihei Avatar asked Jan 01 '19 10:01

heiheihei


People also ask

How do you add two losses in PyTorch?

If you have two different loss functions, finish the forwards for both of them separately, and then finally you can do (loss1 + loss2). backward() . It's a bit more efficient, skips quite some computation.

How does loss work in PyTorch?

PyTorch Cross-Entropy Loss Function This loss function computes the difference between two probability distributions for a provided set of occurrences or random variables. It is used to work out a score that summarizes the average difference between the predicted values and the actual values.

What does Loss backward () do?

Loss Function MSELoss which computes the mean-squared error between the input and the target. So, when we call loss. backward() , the whole graph is differentiated w.r.t. the loss, and all Variables in the graph will have their . grad Variable accumulated with the gradient.

What is a loss multiple?

Multiple losses may occur around the same time, or with little time in between them over a few weeks, months, or years. For example, a person may experience several deaths or serious illnesses of people they care about all within 18 months.

Can PyTorch objects be passed to multiple processes?

We recommend using multiprocessing.Queue for passing all kinds of PyTorch objects between processes. It is possible to e.g. inherit the tensors and storages already in shared memory, when using the fork start method, however it is very bug prone and should be used with care, and only by advanced users.

How can I train a model asynchronously in PyTorch?

Using torch.multiprocessing, it is possible to train a model asynchronously, with parameters either shared all the time, or being periodically synchronized. In the first case, we recommend sending over the whole model object, while in the latter, we advise to only send the state_dict ().

What is torch multiprocessing in Python?

Multiprocessing best practices torch.multiprocessing is a drop in replacement for Python’s multiprocessing module. It supports the exact same operations, but extends it, so that all tensors sent through a multiprocessing.Queue, will have their data moved into shared memory and will only send a handle to another process.

How to do (loss1 + loss2) backward?

If you have two different loss functions, finish the forwards for both of them separately, and then finally you can do (loss1 + loss2).backward (). It’s a bit more efficient, skips quite some computation.


2 Answers

First and 3rd attempt are exactly the same and correct, while 2nd approach is completely wrong.

Reason is, in Pytorch, low layer gradients are Not "overwritten" by subsequent backward() calls, rather they are accumulated, or summed. This makes first and 3rd approach identical, though 1st approach might be preferable if you have low-memory GPU/RAM, since a batch size of 1024 with immediate backward() + step() call is same as having 8 batches of size 128 and 8 backward() calls, with one step() call in the end.

To illustrate the idea, here is a simple example. We want to get our tensor x closest to [40,50,60] simultaneously:

x = torch.tensor([1.0],requires_grad=True) loss1 = criterion(40,x) loss2 = criterion(50,x) loss3 = criterion(60,x) 

Now the first approach: (we use tensor.grad to get current gradient for our tensor x)

loss1.backward() loss2.backward() loss3.backward()  print(x.grad) 

This outputs : tensor([-294.]) (EDIT: put retain_graph=True in first two backward calls for more complicated computational graphs)

The third approach:

loss = loss1+loss2+loss3 loss.backward() print(x.grad) 

Again the output is : tensor([-294.])

2nd approach is different because we don't call opt.zero_grad after calling step() method. This means in all 3 step calls gradients of first backward call is used. For example, if 3 losses provide gradients 5,1,4 for same weight, instead of having 10 (=5+1+4), now your weight will have 5*3+1*2+4*1=21 as gradient.

For further reading : Link 1,Link 2

like image 146
Shihab Shahriar Khan Avatar answered Oct 04 '22 13:10

Shihab Shahriar Khan


-- Comment on first approach removed, see other answer --

Your second approach would require that you backpropagate with retain_graph=True, which incurs heavy computational costs. Moreover, it is wrong, since you would have updated the network weights with the first optimizer step, and then your next backward() call would compute the gradients prior to the update, which means that the second step() call would insert noise into your updates. If on the other hand you performed another forward() call to backpropagate through the updated weights, you would end up having an asynchronous optimization, since the first layers would be updated once with the first step(), and then once more for each subsequent step() call (not wrong per se, but inefficient and probably not what you wanted in the first place).

Long story short, the way to go is the last approach. Reduce each loss into a scalar, sum the losses and backpropagate the resulting loss. Side note; make sure your reduction scheme makes sense (e.g. if you are using reduction='sum' and the losses correspond to a multi-label classification, remember that the number of classes per objective is different, so the relative weight contributed by each loss would also be different)

like image 22
KonstantinosKokos Avatar answered Oct 04 '22 13:10

KonstantinosKokos