Prunning model doesn't improve inference speed or reduce model size

Question

I'm trying to prune my model in PyTorch with torch.nn.utils.prune, which provides 2 tensors,

one is the original weight and
the other is a mask contain 0s and 1s that help us close certain connections in the network.

I have tried both of the solutions, but none improve the inference speed:

Use the network after pruning to infer which will first close some connections with the mask and then run inference.
Zeros out the original weights with the mask and then remove the mask from the state_dict to infer.

Is there a way to improve the speed with the model tensor and the mask? Doesn't multiply with a non-zero float number with 0 will faster than multiply 2 floats with each other?
Here is my prune function and the pruning speed calculating procedure:

def prune_net(net):
    """Prune 20% net's weights that have abs(value) approx. 0
    Function that will be use when an iteration is reach
    Args:

    Return:
        newnet (nn.Module): a newnet contain mask that help prune network's weight
    """
    if not isinstance(net,nn.Module):
        print('Invalid input. Must be nn.Module')
        return
    newnet = copy.copy(net)
    modules_list = []

    for name, module in newnet.named_modules():
        if isinstance(module, torch.nn.Conv2d):
            modules_list += [(module,'weight'),(module,'bias')]
        if isinstance(module, torch.nn.Linear):
            modules_list += [(module,'weight'),(module,'bias')]

    prune.global_unstructured(
        modules_list,
        pruning_method=prune.L1Unstructured,
        amount=0.2,)
    return newnet

Test inference speed 1st case:

import torch
from torch import nn
import torch.nn.utils.prune as prune
import torch.nn.functional as F
import time
from torch.autograd import Variable


torch.set_default_tensor_type('torch.cuda.FloatTensor')
old_net = init_your_net()

new_net = prune_net(old_net)
new_net = prune_net(new_net)

old_net.eval()
new_net.eval()

old_net = old_net.cuda()
new_net = new_net.cuda()
dataset = load_your_dataset()

for i in range(100):
    x = dataset[i]
    x = x.cuda()
    y = x.cuda()

    #new infer
    start_time = time.perf_counter()
    detections = new_net(x).data
    time_new += time.perf_counter() - start_time

    #old infer
    start_time = time.perf_counter()
    detections = old_net(y).data
    time_old += time.perf_counter() - start_time
print('old ',time_old)
print('new ', time_new)

Test inference speed 2nd case:

import torch
from torch import nn
import torch.nn.utils.prune as prune
import torch.nn.functional as F
import time
from torch.autograd import Variable


torch.set_default_tensor_type('torch.cuda.FloatTensor')
old_net = init_your_net()

new_net = prune_net(old_net)
new_net = prune_net(new_net)
# Apply mask to model tensor and remove mask from state_dict
for name, module in new_net.named_modules():
    if isinstance(module, torch.nn.Conv2d):
        prune.remove(module,'weight')
        prune.remove(module,'bias')
    if isinstance(module, torch.nn.Linear):
        prune.remove(module,'weight')
        prune.remove(module,'bias')

old_net.eval()
new_net.eval()

old_net = old_net.cuda()
new_net = new_net.cuda()
dataset = load_your_dataset()

for i in range(100):
    x = dataset[i]
    x = x.cuda()
    y = x.cuda()

    #new infer
    start_time = time.perf_counter()
    detections = new_net(x).data
    time_new += time.perf_counter() - start_time

    #old infer
    start_time = time.perf_counter()
    detections = old_net(y).data
    time_old += time.perf_counter() - start_time
print('old ',time_old)
print('new ', time_new)

UPDATE
I found torch have a sparse module that can reduce memory usage if we prune enough parameter but it hasn't support nn.Module yet, only Tensor object. Here are some useful link:
https://github.com/pytorch/pytorch/issues/36214#issuecomment-619586452
https://pytorch.org/docs/stable/sparse.html

Marco Ancona · Accepted Answer

It is important to understand the difference between unstructured pruning and structured pruning.

Structured pruning: the dimensions of the weight tensors are reduced by removing entire rows/columns of the tensors. This translates into removing neurons with all their incoming and outgoing connections (in dense layers) or entire convolutional filters (in convolutional layers).
Unstructured pruning: individual weights can be "removed" (zeroed-out) without constraints of the shape of the final tensor. This translates into removing individual connections between neurons (in dense layers) or removing individual weights of the convolutional filters (in convolutional layers). Notice that the resulting weight tensors can be sparse but maintain their original shape.

Currently, torch.nn.utils.prune only supports unstructured pruning, which hardly helps to reduce the inference cost because GPUs are not optimized for sparse matrix multiplications. While you might want to reduce the dimensions of your weight tensors to reduce the number of floating-point operations, unstructured pruning produces weight tensors with many zeros but does not automatically reduce the size of such tensors.

Unstructured pruning can help improve the performance only when a lot of weights are removed. In this case, you can either rely on PyTorch sparse operations or try to find rows/columns that contain all zeros and thus can be removed.

Instead, if you want to look into structured pruning, you can take a look at TorchPruner, a library that I have developed myself for research purposes and that provides utilities to find the least important neurons and slice the weight tensors accordingly.

Prunning model doesn't improve inference speed or reduce model size

Tags:

python

machine-learning

pytorch

torchtext

torchvision

manaclan

1 Answers

Marco Ancona

Recent Activity

Donate For Us

Prunning model doesn't improve inference speed or reduce model size

Tags:

python

machine-learning

pytorch

torchtext

torchvision

manaclan

1 Answers

Marco Ancona

Related questions

Recent Activity

Donate For Us