I'm trying to prune my model in PyTorch with torch.nn.utils.prune
, which provides 2 tensors,
I have tried both of the solutions, but none improve the inference speed:
Is there a way to improve the speed with the model tensor and the mask? Doesn't multiply with a non-zero float number with 0 will faster than multiply 2 floats with each other?
Here is my prune function and the pruning speed calculating procedure:
def prune_net(net):
"""Prune 20% net's weights that have abs(value) approx. 0
Function that will be use when an iteration is reach
Args:
Return:
newnet (nn.Module): a newnet contain mask that help prune network's weight
"""
if not isinstance(net,nn.Module):
print('Invalid input. Must be nn.Module')
return
newnet = copy.copy(net)
modules_list = []
for name, module in newnet.named_modules():
if isinstance(module, torch.nn.Conv2d):
modules_list += [(module,'weight'),(module,'bias')]
if isinstance(module, torch.nn.Linear):
modules_list += [(module,'weight'),(module,'bias')]
prune.global_unstructured(
modules_list,
pruning_method=prune.L1Unstructured,
amount=0.2,)
return newnet
Test inference speed 1st case:
import torch
from torch import nn
import torch.nn.utils.prune as prune
import torch.nn.functional as F
import time
from torch.autograd import Variable
torch.set_default_tensor_type('torch.cuda.FloatTensor')
old_net = init_your_net()
new_net = prune_net(old_net)
new_net = prune_net(new_net)
old_net.eval()
new_net.eval()
old_net = old_net.cuda()
new_net = new_net.cuda()
dataset = load_your_dataset()
for i in range(100):
x = dataset[i]
x = x.cuda()
y = x.cuda()
#new infer
start_time = time.perf_counter()
detections = new_net(x).data
time_new += time.perf_counter() - start_time
#old infer
start_time = time.perf_counter()
detections = old_net(y).data
time_old += time.perf_counter() - start_time
print('old ',time_old)
print('new ', time_new)
Test inference speed 2nd case:
import torch
from torch import nn
import torch.nn.utils.prune as prune
import torch.nn.functional as F
import time
from torch.autograd import Variable
torch.set_default_tensor_type('torch.cuda.FloatTensor')
old_net = init_your_net()
new_net = prune_net(old_net)
new_net = prune_net(new_net)
# Apply mask to model tensor and remove mask from state_dict
for name, module in new_net.named_modules():
if isinstance(module, torch.nn.Conv2d):
prune.remove(module,'weight')
prune.remove(module,'bias')
if isinstance(module, torch.nn.Linear):
prune.remove(module,'weight')
prune.remove(module,'bias')
old_net.eval()
new_net.eval()
old_net = old_net.cuda()
new_net = new_net.cuda()
dataset = load_your_dataset()
for i in range(100):
x = dataset[i]
x = x.cuda()
y = x.cuda()
#new infer
start_time = time.perf_counter()
detections = new_net(x).data
time_new += time.perf_counter() - start_time
#old infer
start_time = time.perf_counter()
detections = old_net(y).data
time_old += time.perf_counter() - start_time
print('old ',time_old)
print('new ', time_new)
UPDATE
I found torch have a sparse module that can reduce memory usage if we prune enough parameter but it hasn't support nn.Module yet, only Tensor object. Here are some useful link:
https://github.com/pytorch/pytorch/issues/36214#issuecomment-619586452
https://pytorch.org/docs/stable/sparse.html
It is important to understand the difference between unstructured pruning and structured pruning.
Structured pruning: the dimensions of the weight tensors are reduced by removing entire rows/columns of the tensors. This translates into removing neurons with all their incoming and outgoing connections (in dense layers) or entire convolutional filters (in convolutional layers).
Unstructured pruning: individual weights can be "removed" (zeroed-out) without constraints of the shape of the final tensor. This translates into removing individual connections between neurons (in dense layers) or removing individual weights of the convolutional filters (in convolutional layers). Notice that the resulting weight tensors can be sparse but maintain their original shape.
Currently, torch.nn.utils.prune
only supports unstructured pruning, which hardly helps to reduce the inference cost because GPUs are not optimized for sparse matrix multiplications. While you might want to reduce the dimensions of your weight tensors to reduce the number of floating-point operations, unstructured pruning produces weight tensors with many zeros but does not automatically reduce the size of such tensors.
Unstructured pruning can help improve the performance only when a lot of weights are removed. In this case, you can either rely on PyTorch sparse operations or try to find rows/columns that contain all zeros and thus can be removed.
Instead, if you want to look into structured pruning, you can take a look at TorchPruner, a library that I have developed myself for research purposes and that provides utilities to find the least important neurons and slice the weight tensors accordingly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With