What is the difference between tf.nn_conv2d
and tf.nn.depthwise_conv2d
in Tensorflow?
I am no expert on this, but as far as I understand the difference is this:
Lets say you have an input colour image with length 100, width 100. So the dimensions are 100x100x3. For both examples we use the same filter of width and height 5. Lets say we want the next layer to have a depth of 8.
In tf.nn.conv2d you define the kernel shape as [width, height, in_channels, out_channels]. In our case this means the kernel has shape [5,5,3,out_channels]. The weight-kernel that is strided over the image has a shape of 5x5x3, and it is strided over the whole image 8 times to produce 8 different feature maps.
In tf.nn.depthwise_conv2d you define the kernel shape as [width, height, in_channels, channel_multiplier]. Now the output is produced differently. Separate filters of 5x5x1 are strided over each dimension of the input image, one filter per dimension, each producing one feature map per dimension. So here, a kernel size [5,5,3,1] would produce an output with depth 3. The channel_multiplier tells you how many different filters you want to apply per dimension. So the original desired output of depth 8 is not possible with 3 input dimensions. Only multiples of 3 are possible.
Let's see the sample code in TensorFlow API(r1.7)
For depthwise_conv2d
,
output[b, i, j, k * channel_multiplier + q] =
sum_{di, dj} input[b, strides[1] * i + rate[0] * di,
strides[2] * j + rate[1] * dj, k] *
filter[di, dj, k, q]
filter is [filter_height, filter_width, in_channels, channel_multiplier]
For conv2d
,
output[b, i, j, k] =
sum_{di, dj, q} input[b, strides[1] * i + di,
strides[2] * j + dj, q] *
filter[di, dj, q, k]
filter is [filter_height, filter_width, in_channels, out_channels]
Focusing on k
and q
, we can see the difference shown above.
The default format is NHWC
, where b
is batch size, (i, j)
is a coordinate in feature map.
(Note that k
and q
refer to different things in this two functions.)
depthwise_conv2d
, k
refers to an input channel and q
, 0 <= q < channel_multiplier
, refers to an output channel. Each input channel k
is expanded to k*channel_multiplier
with different filters [filter_height, filter_width, channel_multiplier]
. It does not conduct cross-channel operation, in some literature, it is referred as channel-wise spatial convolution
. Above process can be concluded as applying kernels of each filter separately to each channel and concatenating the outputs.conv2d
, k
refers to an output channel and q
refers to an input channel. It sums up among all input channels, meaning that each output channel k
is associated with all q
input channels by a [filter_height, filter_width, in_channels]
filter.For example,
input_size: (_, 14, 14, 32)
filter of conv2d: (3, 3, 32, 64)
params of conv2d filter: 3x3x32x64
filter of depthwise_conv2d: (3, 3, 32, 64)
params of depthwise_conv2d filter: 3x3x32x64
suppose stride = 1 with padding, then
output of conv2d: (_, 14, 14, 64)
output of depthwise_conv2d: (_, 14, 14, 32*64)
Some more insights:
depthwise_conv2d
is followed by pointwise_conv2d
(a 1x1 convolution for reduction purpose), making a separable_conv2d
. Check Xception, MobileNet for more details.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With