Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you pass video features from a CNN to an LSTM?

After you pass a video frame through a convnet and get an output feature map, how do you pass that data into an LSTM? Also, how do you pass multiple frames to the LSTM thru the CNN?
In other works I want to process video frames with an CNN to get the spatial features. Then I want pass these features to an LSTM to do temporal processing on the spatial features. How do I connect the LSTM to the video features? For example if the input video is 56x56 and then when passed through all of the CNN layers, say it comes out as 20: 5x5's. How are these connected to the LSTM on a frame by frame basis? ANd shoudl they go through a fully connected layer first? Thanks, Jon

like image 971
Jon Avatar asked May 02 '16 22:05

Jon


People also ask

How do I connect CNN to LSTM?

A CNN LSTM can be defined by adding CNN layers on the front end followed by LSTM layers with a Dense layer on the output. It is helpful to think of this architecture as defining two sub-models: the CNN Model for feature extraction and the LSTM Model for interpreting the features across time steps.

What is the difference between ConvLSTM and CNN LSTM?

The ConvLSTM differs from simple CNN + LSTM in that, for CNN + LSTM, the convolution structure (CNN) is applied as the first layer and sequentially LSTM layer is applied in the second layer.

Why is LSTM better than CNN?

An LSTM is designed to work differently than a CNN because an LSTM is usually used to process and make predictions given sequences of data (in contrast, a CNN is designed to exploit “spatial correlation” in data and works well on images and speech).

How does LSTM work in image processing?

This is called the CNN LSTM model, specifically designed for sequence prediction problems with spatial inputs, like images or videos. This architecture involves using Convolutional Neural Network (CNN) layers for feature extraction on input data combined with LSTMs to perform sequence prediction on the feature vectors.


1 Answers

Basically, you can flatten each frame features and feed them into one LSTM cell. With CNN, it's the same. You can feed each output of CNN into one LSTM cell.

For FC, it's up to you.

See a network structure from http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-180.pdf.

enter image description here

like image 195
Sung Kim Avatar answered Sep 28 '22 13:09

Sung Kim