Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How much faster is NCHW compared to NHWC in TensorFlow/cuDNN?

Tags:

tensorflow

gpu

The official TensorFlow performance guide states:

Most TensorFlow operations used by a CNN support both NHWC and NCHW data format. On GPU, NCHW is faster. But on CPU, NHWC is sometimes faster.

How much faster is NCHW compared to NHWC in TensorFlow/cuDNN, for convolution? Are there any references or benchmarks for this?

Also, why is it faster? As I understand (see here), TensorFlow for NHWC on GPU will internally always transpose to NCHW, then calls the cuDNN conv kernel for NCHW, then transpose it back. But why does it do that? The cuDNN conv kernel also works for NHWC. Maybe at some point they did the comparison and the cuDNN conv kernel for NHWC was very slow. But is that up-to-date? And how big was the difference? What are the technical reasons that NHWC is so much slower? Or is the cuDNN kernel for this case just not well optimized?

like image 843
Albert Avatar asked May 31 '17 09:05

Albert


1 Answers

The reason is that most implementations of simple convolutions (not talking winograd or fft here), end up doing some kind of simple matrix multiplication, which means that in their inner loop they multiply some values from both tensors and sum the results.

On a CPU implementation, using SSE or AVX optimization, it's faster to do this along the C dimension, because you just multiply-add the values 4 by 4 or 8 by 8, and then do the reduction (sum your 4 or 8 accumulations) at the end once you added all the C dimension.

On a GPU however, doing a reduction across threads is a more costly operation (at least it was until Kepler introduced wrap-level atomic operations), so historically it has been optimized so that each thread in a wrap reads consecutive (in memory) HW values, and do the accumulation over parts of C with a loop.

Note though that the latest nvidia cards (RTX), now have tensor multiplication cores, that can process small blocks in one operation, including the reduction over a small portion of C, so on these cards it's actually faster to use NHWC (or hybrid NCHWC formats).

like image 96
stephane.c Avatar answered Sep 23 '22 08:09

stephane.c