I wish to implement an operation similar to 2D convolution in TensorFlow. As per my understanding, the most common approach to implementing convolution is by first applying an im2col
operation to the image (see here - subsection "Implementation as Matrix Multiplication") - an operation that transforms an image into a 2D matrix with individual "chunks" of the image to which the kernel is applied as flattened columns.
In other words, this excerpt from the above linked resource explains what im2col
does nicely:
[...] For example, if the input is [227x227x3] (in the format height x width x n_channels) and it is to be convolved with 11x11x3 filters at stride 4, then we would take [11x11x3] blocks of pixels in the input and stretch each block into a column vector of size 11*11*3 = 363. Iterating this process in the input at stride of 4 gives (227-11)/4+1 = 55 locations along both width and height, leading to an output matrix
X_col
ofim2col
of size [363 x 3025], where every column is a stretched out receptive field and there are 55*55 = 3025 of them in total. Note that since the receptive fields overlap, every number in the input volume may be duplicated in multiple distinct columns.
As I understand from the TensorFlow docs, that is what's done internally with tf.nn.conv2d
as well.
Now, I would like to implement said im2col
operation in TensorFlow separately (as I wish to have access to this intermediary result). As this involves copying of values in a non-trivial way, how would I build a relatively efficient computational graph for this operation myself? Similarly, how would one implement the reverse operation?
You can easily do this using extract_image_patches
.
This function puts each filter_size x filter_size
patch of the image into the depth yielding a [batch_size, height, width, 9]
tensor.
To compare against tf.nn.conv2d
you can implement the Sobel operator for images
import tensorflow as tf
import numpy as np
image = np.arange(10 * 10 * 1).reshape(1, 10, 10, 1)
images = tf.convert_to_tensor(image.astype(np.float32))
filter_size = 3
sobel_x = tf.constant([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]], tf.float32)
sobel_x_filter = tf.reshape(sobel_x, [3, 3, 1, 1])
image_patches = tf.extract_image_patches(images,
[1, filter_size, filter_size, 1],
[1, 1, 1, 1], [1, 1, 1, 1],
padding='SAME')
actual = tf.reduce_sum(tf.multiply(image_patches, tf.reshape(sobel_x_filter, [9])), 3, keep_dims=True)
expected = tf.nn.conv2d(images, sobel_x_filter, strides=[1, 1, 1, 1], padding='SAME')
with tf.Session() as sess:
print sess.run(tf.reduce_sum(expected - actual))
This gives you 0.0
as they are equivalent. This does not need a reverse function.
edit:
As I understand from the TensorFlow docs, that is what's done internally with tf.nn.conv2d as well.
Nope, not really. TF on the GPU for example rely on CuDNN which is a more complex beast (winograd, ptx, ...). Only in some circumstances it uses the im2col
approach like here on CPU and the quantized version here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With