Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between tf.nn_conv2d and tf.nn.depthwise_conv2d

What is the difference between tf.nn_conv2d and tf.nn.depthwise_conv2d in Tensorflow?

like image 243
Chaine Avatar asked May 28 '17 11:05

Chaine


2 Answers

I am no expert on this, but as far as I understand the difference is this:

Lets say you have an input colour image with length 100, width 100. So the dimensions are 100x100x3. For both examples we use the same filter of width and height 5. Lets say we want the next layer to have a depth of 8.

In tf.nn.conv2d you define the kernel shape as [width, height, in_channels, out_channels]. In our case this means the kernel has shape [5,5,3,out_channels]. The weight-kernel that is strided over the image has a shape of 5x5x3, and it is strided over the whole image 8 times to produce 8 different feature maps.

In tf.nn.depthwise_conv2d you define the kernel shape as [width, height, in_channels, channel_multiplier]. Now the output is produced differently. Separate filters of 5x5x1 are strided over each dimension of the input image, one filter per dimension, each producing one feature map per dimension. So here, a kernel size [5,5,3,1] would produce an output with depth 3. The channel_multiplier tells you how many different filters you want to apply per dimension. So the original desired output of depth 8 is not possible with 3 input dimensions. Only multiples of 3 are possible.

like image 178
dutchJSCOOP Avatar answered Oct 20 '22 03:10

dutchJSCOOP


Let's see the sample code in TensorFlow API(r1.7)

For depthwise_conv2d,

output[b, i, j, k * channel_multiplier + q] =
    sum_{di, dj} input[b, strides[1] * i + rate[0] * di,
                          strides[2] * j + rate[1] * dj, k] *
                 filter[di, dj, k, q]

filter is [filter_height, filter_width, in_channels, channel_multiplier]

For conv2d,

output[b, i, j, k] =
    sum_{di, dj, q} input[b, strides[1] * i + di,
                             strides[2] * j + dj, q] *
                    filter[di, dj, q, k]

filter is [filter_height, filter_width, in_channels, out_channels]

Focusing on k and q, we can see the difference shown above.

The default format is NHWC, where b is batch size, (i, j) is a coordinate in feature map.

(Note that k and q refer to different things in this two functions.)

  1. For depthwise_conv2d, k refers to an input channel and q, 0 <= q < channel_multiplier, refers to an output channel. Each input channel k is expanded to k*channel_multiplier with different filters [filter_height, filter_width, channel_multiplier]. It does not conduct cross-channel operation, in some literature, it is referred as channel-wise spatial convolution. Above process can be concluded as applying kernels of each filter separately to each channel and concatenating the outputs.
  2. For conv2d, k refers to an output channel and q refers to an input channel. It sums up among all input channels, meaning that each output channel k is associated with all q input channels by a [filter_height, filter_width, in_channels] filter.

For example,

input_size: (_, 14, 14, 32)
filter of conv2d: (3, 3, 32, 64)
params of conv2d filter: 3x3x32x64
filter of depthwise_conv2d: (3, 3, 32, 64)
params of depthwise_conv2d filter: 3x3x32x64

suppose stride = 1 with padding, then

output of conv2d: (_, 14, 14, 64)
output of depthwise_conv2d: (_, 14, 14, 32*64)

Some more insights:

  • Standard convolution operation can be split into 2 steps: depthwise convolution and reduction (sum).
  • Depthwise Convolution is equivalent to setting the number of group to input channel in Group Convolution.
  • Usually, depthwise_conv2d is followed by pointwise_conv2d(a 1x1 convolution for reduction purpose), making a separable_conv2d. Check Xception, MobileNet for more details.
like image 38
drowsyleilei Avatar answered Oct 20 '22 05:10

drowsyleilei