Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can LSTM attention have variable length input

The attention mechanism of LSTM is a straight softmax feed forward network that takes in the hidden states of each time step of the encoder and the decoder's current state.

These 2 steps seems to contradict and can't wrap my head around: 1) The number of inputs to a feed forward network needs to be predefined 2) the number of hidden states of the encoder is variable (depends on number of time steps during encoding).

Am I misunderstanding something? Also would training be the same as if I were to train a regular encoder/decoder network or would I have to train the attention mechanism separately?

Thanks in Advance

like image 468
Andrew Tu Avatar asked Jun 08 '17 18:06

Andrew Tu


People also ask

How does attention work in LSTM?

basic lstm gets confused between the words and sometimes can predict the wrong word. So whenever this type of situation occurs the encoder step needs to search for the most relevant information, this idea is called 'Attention'. A simple structure of the bidirectional LSTM model can be represented by the above image.

How can you deal with variable length input sequences?

The first and simplest way of handling variable length input is to set a special mask value in the dataset, and pad out the length of each input to the standard length with this mask value set for all additional entries created. Then, create a Masking layer in the model, placed ahead of all downstream layers.

Is LSTM an attention model?

This paper proposes an attention-based LSTM (AT-LSTM) model for financial time series prediction. We divide the prediction process into two stages. For the first stage, we apply an attention model to assign different weights to the input features of the financial time series at each time step.

How does attention work in RNN?

Attention is a mechanism combined in the RNN allowing it to focus on certain parts of the input sequence when predicting a certain part of the output sequence, enabling easier learning and of higher quality.


1 Answers

I asked myself the same thing today and found this question. I have never implemented an attention mechanism myself, but from this paper it seems a little bit more than just a straight softmax. For each output yi of the decoder network, a context vector ci is computed as a weighted sum of the encoder hidden states h1, ..., hT:

ci = αi1h1+...+αiThT

The number of time steps T may be different for each sample because the coefficients αij are not vector of fixed size. In fact, they are computed by softmax(ei1, ..., eiT), where each eij is the output of a neural network whose input is the encoder hidden state hj and the decoder hidden state si-1:

eij = f(si-1, hj)

Thus, before yi is computed, this neural network must be evaluated T times, producing T weights αi1,...,αiT. Also, this tensorflow impementation might be useful.

like image 73
Artur Lacerda Avatar answered Sep 20 '22 06:09

Artur Lacerda