PyTorch's nn.Conv2d with half-precision (fp16) is slower than fp32

Question

I have found that a single 2D convolution operation with float16 is slower than with float32.

I am working with a Gtx 1660 Ti with torch.1.8.0+cu111 and cuda-11.1 (also tried with torch.1.9.0)

Dtype	in=1,out=64	in=1,out=128	in=64,out=128
Fp16	3532 it/s	632 it/s	599it/s
Fp32	2160 it/s	1311 it/s	925it/s

I am measuring the convolution speed with the following code.

inputfp16 = torch.arange(0,ch_in*64*64).reshape(1, ch_in, 64, 64).type(torch.float16).to('cuda:0')
inputfp32 = torch.arange(0,ch_in*64*64).reshape(1, ch_in, 64, 64).type(torch.float32).to('cuda:0')

conv2d_16 = nn.Conv2d(ch_in,ch_out, 3, 1, 1).eval().to('cuda:0').half()
conv2d_32 = nn.Conv2d(ch_in,ch_out, 3, 1, 1).eval().to('cuda:0')


for i in tqdm(range(0, 50)):
    out = conv2d_16(inputfp16)
    out.cpu()

for i in tqdm(range(0, 50)):
    out = conv2d_32(inputfp32)
    out.cpu()

It would be great if you let me know whether you have had the same problem, even better if you can provide a solution.

deepconsc · Accepted Answer

Well, the problem lays on the fact that Mixed/Half precision tensor calculations are accelerated via Tensor Cores.

Theoretically (and practically) Tensor Cores are designed to handle lower precision matrix calculations, where, for instance you add the fp32 multiplication product of 2 fp16 matrix calculation to the accumulator.

As long as GTX 1660 TI doesn't come with Tensor Cores, we can conclude that CUDA won't be able to utilize acceleration with Mixed/Half precisions on that GPU.

PyTorch's nn.Conv2d with half-precision (fp16) is slower than fp32

Tags:

python

pytorch

user15450029

1 Answers

deepconsc

Recent Activity

Donate For Us

PyTorch's nn.Conv2d with half-precision (fp16) is slower than fp32

Tags:

python

pytorch

user15450029

1 Answers

deepconsc

Related questions

Recent Activity

Donate For Us