Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does one dynamically add new parameters to optimizers in Pytorch?

I was going through this post in the pytorch forum, and I also wanted to do this. The original post removes and adds layers but I think my situation is not that different. I also want to add layers or more filters or word embeddings. My main motivation is that the AI agent does not know the whole vocabulary/dictionary in advance because its large. I prefer strongly (for the moment) to not do character by character RNNs.

So what will happen for me is when the agent starts a forward pass it might find new words it has never seen and will need to add them to the embedding table (or perhaps add new filters before it starts the forward pass).

So what I want to make sure is:

  1. embeddings are added correctly (at the right time, when a new computation graph is made) so that they are updatable by the optimizer
  2. no issues with stored info of past parameters e.g. if its using some sort of momentum

How does one do this? Any sample code that works?

like image 371
Charlie Parker Avatar asked Apr 11 '19 20:04

Charlie Parker


People also ask

What is model parameters () PyTorch?

The PyTorch parameter is a layer made up of nn or a module. A parameter that is assigned as an attribute inside a custom model is registered as a model parameter and is thus returned by the caller model. parameters(). We can say that a Parameter is a wrapper over Variables that are formed.

What does Optimizer step do in PyTorch?

After computing the gradients for all tensors in the model, calling optimizer. step() makes the optimizer iterate over all parameters (tensors) it is supposed to update and use their internally stored grad to update their values.


2 Answers

Just to add an answer to the title of your question: "How does one dynamically add new parameters to optimizers in Pytorch?"

You can append params at any time to the optimizer:

import torch
import torch.optim as optim

model = torch.nn.Linear(2, 2) 

# Initialize optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001, momentum=0.9)

extra_params = torch.randn(2, 2)
optimizer.param_groups.append({'params': extra_params })

#then you can print your `extra_params`
print("extra params", extra_params)
print("optimizer params", optimizer.param_groups)
like image 103
prosti Avatar answered Oct 16 '22 11:10

prosti


That is a tricky question, as I would argue that the answer is "depends", in particular on how you want to deal with the optimizer.

Let's start with your specific problem - an embedding. In particular, you are asking on how to add embeddings to allow for a larger vocabulary dynamically. My first advice is, that if you have a good sense of an upper boundary of your vocabulary size, make the embedding large enough to cope with it from the beginning, as this is more efficient, and as you will need the memory eventually anyway. But this is not what you asked. So - to dynamically change your embedding, you'll need to overwrite your old one with a new one, and inform your optimizer of the change. You can simply do that whenever you run into an exception with your old embedding, in a try ... except block. This should roughly follow this idea:

# from within whichever module owns the embedding
# remember the already trained weights
old_embedding_weights = self.embedding.weight.data
# create a new embedding of the new size
self.embedding = nn.Embedding(new_vocab_size, embedding_dim)
# initialize the values for the new embedding. this does random, but you might want to use something like GloVe
new_weights = torch.randn(new_vocab_size, embedding_dim)
# as your old values may have been updated, you want to retrieve these updates values
new_weights[:old_vocab_size] = old_embedding_weights
self.embedding.weights.data.copy_(new_weights)

However, you should not do this for every single new word you receive, as this copying takes time (and a whole lot of memory, as the embedding exists twice for a short time - if you're nearly out memory, just make your embedding large enough from the start). So instead increase the size dynamically by a couple of hundred slots at a time.

Additionally, this first step already raises some questions:

  1. How does my respective nn.Module know about the new embedding parameter? The __setattr__ method of nn.Module takes care of that (see here)
  2. Second, why don't I simply change my parameter? That's already pointing towards some of the problems of changing the optimizer: pytorch internally keeps references by object ID. This means that if you change your object, all these references will point towards a potentially incompatible object, as its properties have changed. So we should simply create a new parameter instead.
  3. What about other nn.Parameters or nn.Modules that are not embeddings? These you treat the same. You basically just instantiate them, and attach them to their parent module. The __setattr__ method will take care of the rest. So you can do so completely dyncamically ...

Except, of course, the optimizer. The optimizer is the only other thing that "knows" about your parameters except for your main model-module. So you need to let the optimizer know of any change.

And this is tricky, if you want to be sophisticated about it, and very easy if you don't care about keeping the optimizer state. However, even if you want to be sophisticated about it, there is a very good reason why you probably should not do this anyways. More about that below.

Anyways, if you don't care, a simple

# simply overwrite your old optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001)

will do. If you care, however, you want to transfer your old state, you can do so the same way that you can store, and later load parameters and optimizer states from disk: using the .state_dict() and .load_state_dict() methods. This, however, does only work with a twist:

# extract the state dict from your old optimizer
old_state_dict = optimizer.state_dict()
# create a new optimizer
optimizer = optim.SGD(model.parameters())
new_state_dict = optimizer.state_dict()
# the old state dict will have references to the old parameters, in state_dict['param_groups'][xyz]['params'] and in state_dict['state']
# you now need to find the parameter mismatches between the old and new statedicts
# if your optimizer has multiple param groups, you need to loop over them, too (I use xyz as a placeholder here. mostly, you'll only have 1 anyways, so just replace xyz with 0
new_pars = [p for p in new_state_dict['param_groups'][xyz]['params'] if not p in old_state_dict['param_groups'][xyz]['params']]
old_pars = [p for p in old_state_dict['param_groups'][xyz]['params'] if not p in new_state_dict['param_groups'][xyz]['params']]
# then you remove all the outdated ones from the state dict
for pid in old_pars:
    old_state_dict['state'].pop(pid)
# and add a new state for each new parameter to the state:
for pid in new_pars:
    old_state_dict['param_groups'][xyz]['params'].append(pid)
    old_state_dict['state'][pid] = { ... }  # your new state def here, depending on your optimizer

However, here's the reason why you should probably never update your optimizer like this, but should instead re-initialize from scratch, and just accept the loss of state information: When you change your computation graph, you change forward and backward computation for all parameters along your computation path (if you do not have a branching architecture, this path will be your entire graph). This more specifically means, that the input to your functions (=layer/nn.Module) will be different if you change some function (=layer/nn.Module) applied earlier, and the gradients will change if you change some function (=layer/nn.Module) applied later. That in turn invalidates the entire state of your optimizer. So if you keep your optimizer's state around, it will be a state computed for a different computation graph, and will probably end up in catastrophic behavior on part of your optimizer, if you try to apply it to a new computation graph. (I've been there ...)

So - to sum it up: I'd really recommend to try to keep it simple, and to only change a parameter as conservatively as possible, and not to touch the optimizer.

like image 44
cleros Avatar answered Oct 16 '22 13:10

cleros