I was trying to understand some basics about the tensorflow and I got stuck while reading documentation for max pooling 2D layer: https://www.tensorflow.org/tutorials/layers#pooling_layer_1
This is how max_pooling2d is specified:
pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)
where conv1
has a tensor with shape [batch_size, image_width, image_height, channels]
, concretely in this case it's [batch_size, 28, 28, 32]
.
So our input is a tensor with shape: [batch_size, 28, 28, 32]
.
My understanding of a max pooling 2D layer is that it will apply a filter of size pool_size
(2x2 in this case) and moving sliding window by stride
(also 2x2). This means that both width
and height
of the image will be halfed, i.e. we will end up with 14x14 pixels per channel (32 channels in total), meaning our output is a tensor with shape: [batch_size, 14, 14, 32]
.
However, according to the above link, the shape of the output tensor is [batch_size, 14, 14, 1]
:
Our output tensor produced by max_pooling2d() (pool1) has a shape of
[batch_size, 14, 14, 1]: the 2x2 filter reduces width and height by 50%.
What am I missing here?
How was 32 converted to 1?
They apply the same logic later here: https://www.tensorflow.org/tutorials/layers#convolutional_layer_2_and_pooling_layer_2
but this time it's correct, i.e. [batch_size, 14, 14, 64]
becomes [batch_size, 7, 7, 64]
(number of channels is the same).
Max pooling operation for 2D spatial data. Downsamples the input along its spatial dimensions (height and width) by taking the maximum value over an input window (of size defined by pool_size ) for each channel of the input. The window is shifted by strides along each dimension.
Max Pooling is a pooling operation that calculates the maximum value for patches of a feature map, and uses it to create a downsampled (pooled) feature map. It is usually used after a convolutional layer.
Global max pooling = ordinary max pooling layer with pool size equals to the size of the input (minus filter size + 1, to be precise).
Max pooling is a pooling operation that selects the maximum element from the region of the feature map covered by the filter. Thus, the output after max-pooling layer would be a feature map containing the most prominent features of the previous feature map.
Yes, use 2x2 max pool with strides=2x2 will reduce data to a half, and the output depth will not be changed. This is my test code from your given, the output shape is (14, 14, 32)
, maybe something wrong?
#!/usr/bin/env python
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('./MNIST_data/', one_hot=True)
conv1 = tf.placeholder(tf.float32, [None,28,28,32])
pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2,2], strides=2)
print pool1.get_shape()
the output is:
Extracting ./MNIST_data/train-images-idx3-ubyte.gz
Extracting ./MNIST_data/train-labels-idx1-ubyte.gz
Extracting ./MNIST_data/t10k-images-idx3-ubyte.gz
Extracting ./MNIST_data/t10k-labels-idx1-ubyte.gz
(?, 14, 14, 32)
Nikola, it has been corrected as you thought.
Learning the concept of convolution and pooling, I come across this thread. Thank you for your question, which takes me to the informative documentation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With