I have found that a single 2D convolution operation with float16 is slower than with float32.
I am working with a Gtx 1660 Ti with torch.1.8.0+cu111
and cuda-11.1
(also tried with torch.1.9.0
)
Dtype | in=1,out=64 | in=1,out=128 | in=64,out=128 |
---|---|---|---|
Fp16 | 3532 it/s | 632 it/s | 599it/s |
Fp32 | 2160 it/s | 1311 it/s | 925it/s |
I am measuring the convolution speed with the following code.
inputfp16 = torch.arange(0,ch_in*64*64).reshape(1, ch_in, 64, 64).type(torch.float16).to('cuda:0')
inputfp32 = torch.arange(0,ch_in*64*64).reshape(1, ch_in, 64, 64).type(torch.float32).to('cuda:0')
conv2d_16 = nn.Conv2d(ch_in,ch_out, 3, 1, 1).eval().to('cuda:0').half()
conv2d_32 = nn.Conv2d(ch_in,ch_out, 3, 1, 1).eval().to('cuda:0')
for i in tqdm(range(0, 50)):
out = conv2d_16(inputfp16)
out.cpu()
for i in tqdm(range(0, 50)):
out = conv2d_32(inputfp32)
out.cpu()
It would be great if you let me know whether you have had the same problem, even better if you can provide a solution.
Well, the problem lays on the fact that Mixed/Half precision tensor calculations are accelerated via Tensor Cores.
Theoretically (and practically) Tensor Cores are designed to handle lower precision matrix calculations, where, for instance you add the fp32 multiplication product of 2 fp16 matrix calculation to the accumulator.
As long as GTX 1660 TI doesn't come with Tensor Cores, we can conclude that CUDA won't be able to utilize acceleration with Mixed/Half precisions on that GPU.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With