I am trying to implement L1 regularization onto the first layer of a simple neural network (1 hidden layer). I looked into some other posts on StackOverflow that apply l1 regularization using Pytorch to figure out how it should be done (references: Adding L1/L2 regularization in PyTorch?, In Pytorch, how to add L1 regularizer to activations?). No matter how high I increase lambda (the l1 regularization strength parameter) I do not get true zeros in the first weight matrix. Why would this be? (Code is below)
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
class Network(nn.Module):
def __init__(self,nf,nh,nc):
super(Network,self).__init__()
self.lin1=nn.Linear(nf,nh)
self.lin2=nn.Linear(nh,nc)
def forward(self,x):
l1out=F.relu(self.lin1(x))
out=F.softmax(self.lin2(l1out))
return out, l1out
def l1loss(layer):
return torch.norm(layer.weight.data, p=1)
nf=10
nc=2
nh=6
learningrate=0.02
lmbda=10.
batchsize=50
net=Network(nf,nh,nc)
crit=nn.MSELoss()
optimizer=torch.optim.Adagrad(net.parameters(),lr=learningrate)
xtr=torch.Tensor(xtr)
ytr=torch.Tensor(ytr)
#ytr=torch.LongTensor(ytr)
xte=torch.Tensor(xte)
yte=torch.LongTensor(yte)
#cyte=torch.Tensor(yte)
it=200
for epoch in range(it):
per=torch.randperm(len(xtr))
for i in range(0,len(xtr),batchsize):
ind=per[i:i+batchsize]
bx,by=xtr[ind],ytr[ind]
optimizer.zero_grad()
output, l1out=net(bx)
# l1reg=l1loss(net.lin1)
loss=crit(output,by)+lmbda*l1loss(net.lin1)
loss.backward()
optimizer.step()
print('Epoch [%i/%i], Loss: %.4f' %(epoch+1,it, np.float32(loss.data.numpy())))
corr=0
tot=0
for x,y in list(zip(xte,yte)):
output,_=net(x)
_,pred=torch.max(output,-1)
tot+=1 #y.size(0)
corr+=(pred==y).sum()
print(corr)
Note: The data has 10 features (2 classes and 800 training samples) and only the first 2 are relevant (by design) so one would assume true zeros should be easy enough to learn.
The black circle in all the contours represents the one which interesects the L1 Norm or Lasso. It intersects relatively close to axes. This results in making coefficients to 0 and hence feature selection. Hence L1 norm make the model sparse.
Reason for sparsity L1 regularization causes coefficients to converge to 0 rather quickly since the constraint bounds all weight vectors to lie within the L1 norm. The rate of convergence is higher for L1 due to the first derivative of loss being simply λ for L1 whereas being 2 λ 2\lambda 2λ for L2.
The reason for using L1 norm to find a sparse solution is due to its special shape. It has spikes that happen to be at sparse points. Using it to touch the solution surface will very likely to find a touch point on a spike tip and thus a sparse solution.
L1 regularization is the preferred choice when having a high number of features as it provides sparse solutions. Even, we obtain the computational advantage because features with zero coefficients can be avoided. The regression model that uses L1 regularization technique is called Lasso Regression.
Your usage of layer.weight.data
removes the parameter (which is a PyTorch variable) from its automatic differentiation context, making it a constant when the optimiser takes the gradients. This results in zero gradients and that the L1 loss is not computed.
If you remove the .data
, the norm is computed of the PyTorch variable and the gradients should be correct.
For more information on PyTorch's automatic differentiation mechanics, see this docs article or this tutorial.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With