Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running Pytorch Quantized Model on CUDA GPU

I am confused about whether it is possible to run an int8 quantized model on CUDA, or can you only train a quantized model on CUDA with fakequantise for deployment on another backend such as a CPU.

I want to run the model on CUDA with actual int8 instructions instead of FakeQuantised float32 instructions, and enjoy the efficiency gains. Pytorch docs are strangely nonspecific about this. If it is possible to run a quantized model on CUDA with a different framework such as TensorFlow I would love to know.

This is the code to prep my quantized model (using post-training quantization). The model is normal CNN with nn.Conv2d and nn.LeakyRelu and nn.MaxPool modules:

model_fp = torch.load(models_dir+net_file)

model_to_quant = copy.deepcopy(model_fp)
model_to_quant.eval()
model_to_quant = quantize_fx.fuse_fx(model_to_quant)

qconfig_dict = {"": torch.quantization.get_default_qconfig('qnnpack')}

model_prepped = quantize_fx.prepare_fx(model_to_quant, qconfig_dict)
model_prepped.eval()
model_prepped.to(device='cuda:0')

train_data   = ImageDataset(img_dir, train_data_csv, 'cuda:0')
train_loader = DataLoader(train_data, batch_size=32, shuffle=True, pin_memory=True)

for i, (input, _) in enumerate(train_loader):
    if i > 1: break
    print('batch', i+1, end='\r')
    input = input.to('cuda:0')
    model_prepped(input)

This actually quantizes the model:

model_quantised = quantize_fx.convert_fx(model_prepped)
model_quantised.eval()

This is an attempt to run the quantized model on CUDA, and raises a NotImplementedError, when I run it on CPU it works fine:

model_quantised = model_quantised.to('cuda:0')
for i, _ in train_loader:
    input = input.to('cuda:0')
    out = model_quantised(input)
    print(out, out.shape)
    break

This is the error:

Traceback (most recent call last):
  File "/home/adam/Desktop/thesis/Ship Detector/quantisation.py", line 54, in <module>
    out = model_quantised(input)
  File "/home/adam/.local/lib/python3.9/site-packages/torch/fx/graph_module.py", line 513, in wrapped_call
    raise e.with_traceback(None)
NotImplementedError: Could not run 'quantized::conv2d.new' with arguments from the 'QuantizedCUDA' backend. 
This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). 
If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 
'quantized::conv2d.new' is only available for these backends: [QuantizedCPU, BackendSelect, Named, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, UNKNOWN_TENSOR_TYPE_ID, AutogradMLC, Tracer, Autocast, Batched, VmapMode].
like image 278
Adam Avatar asked May 26 '26 00:05

Adam


1 Answers

From [this][1] blog, it looks like you cannot run quantized models on GPU.

Quantization in PyTorch is currently CPU-only. Quantization is not a CPU-specific technique (e.g. NVIDIA's TensorRT can be used to implement quantization on GPU). However, inference time on GPU is already usually "fast enough", and CPUs are more attractive for large-scale model server deployment (due to complex cost factors that are out of the scope of this article). Consequently, as of PyTorch 1.6, only CPU backends are available in the native API.

[1]: https://spell.ml/blog/pytorch-quantization-X8e7wBAAACIAHPhT#:~:text=Quantization%20in%20PyTorch%20is%20currently,to%20implement%20quantization%20on%20GPU).

like image 106
user3303020 Avatar answered May 28 '26 14:05

user3303020