I am confused about whether it is possible to run an int8 quantized model on CUDA, or can you only train a quantized model on CUDA with fakequantise for deployment on another backend such as a CPU.
I want to run the model on CUDA with actual int8 instructions instead of FakeQuantised float32 instructions, and enjoy the efficiency gains. Pytorch docs are strangely nonspecific about this. If it is possible to run a quantized model on CUDA with a different framework such as TensorFlow I would love to know.
This is the code to prep my quantized model (using post-training quantization). The model is normal CNN with nn.Conv2d and nn.LeakyRelu and nn.MaxPool modules:
model_fp = torch.load(models_dir+net_file)
model_to_quant = copy.deepcopy(model_fp)
model_to_quant.eval()
model_to_quant = quantize_fx.fuse_fx(model_to_quant)
qconfig_dict = {"": torch.quantization.get_default_qconfig('qnnpack')}
model_prepped = quantize_fx.prepare_fx(model_to_quant, qconfig_dict)
model_prepped.eval()
model_prepped.to(device='cuda:0')
train_data = ImageDataset(img_dir, train_data_csv, 'cuda:0')
train_loader = DataLoader(train_data, batch_size=32, shuffle=True, pin_memory=True)
for i, (input, _) in enumerate(train_loader):
if i > 1: break
print('batch', i+1, end='\r')
input = input.to('cuda:0')
model_prepped(input)
This actually quantizes the model:
model_quantised = quantize_fx.convert_fx(model_prepped)
model_quantised.eval()
This is an attempt to run the quantized model on CUDA, and raises a NotImplementedError, when I run it on CPU it works fine:
model_quantised = model_quantised.to('cuda:0')
for i, _ in train_loader:
input = input.to('cuda:0')
out = model_quantised(input)
print(out, out.shape)
break
This is the error:
Traceback (most recent call last):
File "/home/adam/Desktop/thesis/Ship Detector/quantisation.py", line 54, in <module>
out = model_quantised(input)
File "/home/adam/.local/lib/python3.9/site-packages/torch/fx/graph_module.py", line 513, in wrapped_call
raise e.with_traceback(None)
NotImplementedError: Could not run 'quantized::conv2d.new' with arguments from the 'QuantizedCUDA' backend.
This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build).
If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions.
'quantized::conv2d.new' is only available for these backends: [QuantizedCPU, BackendSelect, Named, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, UNKNOWN_TENSOR_TYPE_ID, AutogradMLC, Tracer, Autocast, Batched, VmapMode].
From [this][1] blog, it looks like you cannot run quantized models on GPU.
Quantization in PyTorch is currently CPU-only. Quantization is not a CPU-specific technique (e.g. NVIDIA's TensorRT can be used to implement quantization on GPU). However, inference time on GPU is already usually "fast enough", and CPUs are more attractive for large-scale model server deployment (due to complex cost factors that are out of the scope of this article). Consequently, as of PyTorch 1.6, only CPU backends are available in the native API.
[1]: https://spell.ml/blog/pytorch-quantization-X8e7wBAAACIAHPhT#:~:text=Quantization%20in%20PyTorch%20is%20currently,to%20implement%20quantization%20on%20GPU).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With