When we use Adam optimizer, if we want to continue train a network from a pretrained model, we not only should load "model.state_dict", but also "optimizer.state_dict". And, if we modified our network's structure, we should also modify saved optimizer's state_dict to make our loading successful.
But I don't understand some params in saved "optimizer.state_dict". like optim_dict["state"] (dict_keys(['step', 'exp_avg', 'exp_avg_sq', 'max_exp_avg_sq']))and optim_dict['param_groups'][0]['params']. There are many of numbers like these:
b['optimizer_state_dict']['state'].keys()
Out[71]: dict_keys([140623218628000, 140623218628072, 140623218628216, 140623218628360, 140623218628720, 140623218628792, 140623218628936, 140623218629080, 140623218629656, 140623218629728, 140623218629872, 140623218630016, 140623218630376, 140623218630448, 140623218716744, 140623218716816, 140623218717392, 140623218717464, 140623218717608, 140623218717752, 140623218718112, 140623218718184, 140623218718328, 140623218718472, 140623218719048, 140623218719120, 140623218719264, 140623218719408, 140623218719768, 140623218719840, 140623218719984, 140623218720128, 140623218720704, 140623209943112, 140623209943256, 140623209943400, 140623209943760, 140623209943832, 140623209943976, 140623209944120, 140623209944696, 140623209944768, 140623209944912, 140623209945056, 140623209945416, 140623209945488, 140623209945632, 140623209945776, 140623209946352, 140623209946424, 140623209946568, 140623209946712, 140623209947072, 140623210041416, 140623210041560, 140623210041704, 140623244033768, 140623244033840, 140623244033696, 140623244033912, 140623244033984, 140623244070984, 140623244071056, 140623244071128, 140623429501576, 140623244071200, 140623244071272, 140623244071344, 140623244071416, 140623244071488, 140623244071560, 140623244071632, 140623244071848, 140623244071920, 140623244072064, 140623244072208, 140623244072424, 140623244072496, 140623244072640, 140623244072784, 140623244073216, 140623244073288, 140623244073432, 140623244073576, 140623244073792, 140623244073864, 140623244074008, 140623244074152, 140623244074584, 140623244074656, 140623244074800, 140623244074944, 140623218540760, 140623218540832, 140623218540976, 140623218541120, 140623218541552, 140623218541624, 140623218541768, 140623218541912, 140623218542128, 140623218542200, 140623218542344, 140623218542488, 140623218542920, 140623218542992, 140623218543136, 140623218543280, 140623218543496, 140623218543568, 140623218543712, 140623218543856, 140623218544288, 140623218544360, 140623218544504, 140623218626632, 140623218626992, 140623218627064, 140623218627208, 140623218627352, 140623218627784, 140623218629440, 140623218717176, 140623218718832, 140623218720488, 140623209944480, 140623209946136, 140623210043000])
In [44]: b['optimizer_state_dict']['state'][140623218628072].keys()
Out[44]: dict_keys(['step', 'exp_avg', 'exp_avg_sq', 'max_exp_avg_sq'])
In [45]: b['optimizer_state_dict']['state'][140623218628072]['exp_avg'].shape
Out[45]: torch.Size([480])
In contrast to model's state_dict, which saves learnable parameters, the optimizer's state_dict contains information about the optimizer’s state (parameters to be optimized), as well as the hyperparameters used.
All optimizers in PyTorch need to inherit from the base class torch.optim.Optimizer. It requires two entries:
params (iterable) – an iterable of torch.Tensors or dicts. Specifies what Tensors should be optimized.defaults (dict): a dict containing default values of optimization options (used when a parameter group doesn’t specify them).In addition to that, optimizers also support specifying per-parameter options.
To do this, instead of passing an iterable of
Tensors, pass in an iterable ofdicts. Each of them will define a separate parameter group, and should contain aparamskey, containing a list of parameters belonging to it.
Consider an example,
optim.SGD([
{'params': model.base.parameters()},
{'params': model.classifier.parameters(), 'lr': 1e-3}
], lr=1e-2, momentum=0.9)
Here, we have provided the a) params, b) default hyperparameters: lr, momentum, and c) a parameter group. In this case, the model.base’s parameters will use the default learning rate of 1e-2, model.classifier’s parameters will use a learning rate of 1e-3, and a momentum of 0.9 will be used for all parameters.
The step (optimizer.step()) performs a single optimization step (parameter update), which changes the state of the optimizer.
Now, coming to optimizer's state_dict, it returns the state of the optimizer as a dict. It contains two entries:
state - a dict holding current optimization state.param_groups - a dict containing all parameter groups (as discussed above)Some of the hyperparameters are specific to the optimizer or model used e.g. (used in Adam)
exp_avg: exponential moving average of gradient valuesexp_avg_sq: exponential moving average of squared gradient valuesIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With