For most PyTorch codes we use the following definition of Adam optimizer,
optim = torch.optim.Adam(model.parameters(), lr=cfg['lr'], weight_decay=cfg['weight_decay'])
However, after repeated trials, I found that the following definition of Adam gives 1.5 dB higher PSNR which is huge.
optim = torch.optim.Adam(
[
{'params': get_parameters(model, bias=False)},
{'params': get_parameters(model, bias=True), 'lr': cfg['lr'] * 2, 'weight_decay': 0},
],
lr=cfg['lr'],
weight_decay=cfg['weight_decay'])
The Model is a usual U-net with parameters defined in init
and forward action
as in any other PyTorch model.
The get_parameters is defined as below.
def get_parameters(model, bias=False):
for k, m in model._modules.items():
print("get_parameters", k, type(m), type(m).__name__, bias)
if bias:
if isinstance(m, nn.Conv2d):
yield m.bias
else:
if isinstance(m, nn.Conv2d) or isinstance(m, nn.ConvTranspose2d):
yield m.weight
Could someone explain why the latter definition is better than the previous one?
In the second method, different configurations are being provided to update weights and biases. This is being done using per-parameter options for the optimizer.
optim = torch.optim.Adam(
[
{'params': get_parameters(model, bias=False)},
{'params': get_parameters(model, bias=True), 'lr': cfg['lr'] * 2, 'weight_decay': 0},
],
lr=cfg['lr'],
weight_decay=cfg['weight_decay'])
As per this, the learning rate for biases is 2 times that of weights, and weight decay is 0.
Now, the reason why it's being done could be the network not learning properly. Read more Why is the learning rate for the bias usually twice as large as the the LR for the weights?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With