Does attention make sense for Autoencoders?

Tags:

I am struggling with the concept of attention in the the context of autoencoders. I believe I understand the usage of attention with regards to seq2seq translation - after training the combined encoder and decoder, we can use both encoder and decoder to create (for example) a language translator. Because we are still using the decoder in production, we can take advantage of the attention mechanism.

However, what if the main goal of the autoencoder is mainly to produce a latent compressed representation of the input vector? I am talking about cases where we can essentially dispose of the decoder part of the model after training.

For example, if I use an LSTM without attention, the "classic" approach is to use the last hidden state as the context vector - it should represent the main features of my input sequence. If I were to use an LSTM with attention, my latent representation would have to be all hidden states per time step. This doesn't seem to fit into the notion of input compression and of keeping the main features. Its likely that the dimensionality may even be siginificantly higher.

Additionally, if I needed to use all hidden states as my latent representation (like in the attention case) - why use attention at all? I could just use all hidden states to initialize the decoder.

652

asked Sep 28 '19 10:09

user3641187

1 Answers

The answer depends very much on what you aim to use the representation from the autoencoder for. Each autoencoder needs something that makes the autoencoding task hard, so it needs a rich intermediate representation to solve the task. It can be either a bottleneck in the architecture (as in the case of the vanilla encoder-decoder model) or adding noise in the source side (you can view BERT as a special case of denoising autoencoder where some input tokens are masked).

If you do not introduce any noise on the source side, the autoencoder would learn to simply copy the input without learning anything beyond the identity of input/output symbols – the attention would break the bottleneck property of the vanilla model. The same holds also for the case of labeling the encoder states.

There are sequence-to-sequence autoencoders (BART, MASS) that use encoder-decoder attention. The generated noise includes masking and randomly permuting tokens. The representation that they learn is then more suitable for sequence-to-sequence tasks (such as text summarization or low-resource machine translation) than representations from encoder-only models such as BERT.

171

answered Sep 20 '22 14:09

Jindřich

Related questions
                            
                                Multilayer Seq2Seq model with LSTM in Keras
                            
                                Prevent over-fitting of text classification using Word embedding with LSTM
                            
                                Error when checking target: expected dense_1 to have 3 dimensions, but got array with shape (118, 1)
                            
                                Multivariate input LSTM in pytorch
                            
                                Why does my keras LSTM model get stuck in an infinite loop?
                            
                                How to prepare data for LSTM when using multiple time series of different lengths and multiple features?
                            
                                Understanding LSTM model using tensorflow for sentiment analysis
                            
                                How to get reproducible result when running Keras with Tensorflow backend
                            
                                Keras LSTM predicted timeseries squashed and shifted
                            
                                How to combine numerical and categorical values in a vector as input for LSTM?
                            
                                Bidirectional LSTM output question in PyTorch
                            
                                Keras LSTM: a time-series multi-step multi-features forecasting - poor results
                            
                                TensorFlow using LSTMs for generating text
                            
                                Regularization for LSTM in tensorflow
                            
                                LSTM with Keras for mini-batch training and online testing
                            
                                Initializing LSTM hidden state Tensorflow/Keras
                            
                                LSTM module for Caffe
                            
                                Keras attention layer over LSTM

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does attention make sense for Autoencoders?

Tags:

lstm

recurrent-neural-network

autoencoder

attention-model

dimensionality-reduction

user3641187

People also ask

1 Answers

Jindřich

Recent Activity

Donate For Us