Recently Tensorflow added support for 3d convolution. I'm attempting to train some video stuff.
I have a few of questions:
My inputs are 16-frame, 3-channel per frame .npy
files, so their shape is: (128, 171, 48)
.
1) The docs for tf.nn.max_pool3d()
state the shape of the input should be:
Shape [batch, depth, rows, cols, channels]
. Is my channels dimension still 3 even though my npy imgs
are 48 channels deep, so to speak?
2) The next question dovetails from the last one: is my depth 48 or 16?
3) (since I'm here) The batch dimension is the same with 3d arrays, correct? The images are just like any other image, processed one at a time.
Just to be clear: in my case, for a single image batch size, with the image dims above, my dimensions are:
[1(batch),16(depth), 171(rows), 128(cols), 3(channels)]
EDIT: I've confused raw input size with pooling and kernel sizes here. Perhaps some general guidance on this 3D stuff would be helpful. I basically am stuck on the dimensions for both convolution and pooling, as is clear in the original question.
To answer your question, the dimension should be (as you stated):
[batch_size, depth, H, W, 3]
where depth
is the number of time frames you have.
For instance, a 5s video with 20 frames/s will have depth=100
.
My best advice would be to first read the slides from CS231n about deep learning for videos here (if you can see the video, it's even better).
Basically, a 3D convolution is the same as a 2D convolution but with one more dimension. Let's do a recap:
[batch_size, 10, in_channels]
[3, in_channels, out_channels]
in_channels
[batch_size, 10, 10, in_channels]
[3, 3, in_channels, out_channels]
in_channels=3
[batch_size, T, 10, 10, in_channels]
[T_kernel, 3, 3, in_channels, out_channels]
T=100
frames, and images of size 10x10, with in_channels=3
T_kernel
(ex: T_kernel=10
)The goal of a convolution is to reduce the number of parameters because of redundancies in the data. For images, you can extract the same basic features in the top left 3x3 box and the bottom right 3x3 box.
For videos, this is the same. You can extract information from a 3x3 box of the image, but within a time frame (ex: 10 frames). The result will have a receptive field of 3x3 in image dimension, and 10 frames in time dimension.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With