Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyTorch's nn.Conv2d with half-precision (fp16) is slower than fp32

Tags:

python

pytorch

I have found that a single 2D convolution operation with float16 is slower than with float32.

I am working with a Gtx 1660 Ti with torch.1.8.0+cu111 and cuda-11.1 (also tried with torch.1.9.0)

Dtype in=1,out=64 in=1,out=128 in=64,out=128
Fp16 3532 it/s 632 it/s 599it/s
Fp32 2160 it/s 1311 it/s 925it/s

I am measuring the convolution speed with the following code.

inputfp16 = torch.arange(0,ch_in*64*64).reshape(1, ch_in, 64, 64).type(torch.float16).to('cuda:0')
inputfp32 = torch.arange(0,ch_in*64*64).reshape(1, ch_in, 64, 64).type(torch.float32).to('cuda:0')

conv2d_16 = nn.Conv2d(ch_in,ch_out, 3, 1, 1).eval().to('cuda:0').half()
conv2d_32 = nn.Conv2d(ch_in,ch_out, 3, 1, 1).eval().to('cuda:0')


for i in tqdm(range(0, 50)):
    out = conv2d_16(inputfp16)
    out.cpu()

for i in tqdm(range(0, 50)):
    out = conv2d_32(inputfp32)
    out.cpu()

It would be great if you let me know whether you have had the same problem, even better if you can provide a solution.

like image 435
user15450029 Avatar asked Sep 13 '25 13:09

user15450029


1 Answers

Well, the problem lays on the fact that Mixed/Half precision tensor calculations are accelerated via Tensor Cores.

Theoretically (and practically) Tensor Cores are designed to handle lower precision matrix calculations, where, for instance you add the fp32 multiplication product of 2 fp16 matrix calculation to the accumulator.

As long as GTX 1660 TI doesn't come with Tensor Cores, we can conclude that CUDA won't be able to utilize acceleration with Mixed/Half precisions on that GPU.

like image 75
deepconsc Avatar answered Sep 15 '25 02:09

deepconsc