I am struggling with the concept of attention in the the context of autoencoders. I believe I understand the usage of attention with regards to seq2seq translation - after training the combined encoder and decoder, we can use both encoder and decoder to create (for example) a language translator. Because we are still using the decoder in production, we can take advantage of the attention mechanism.
However, what if the main goal of the autoencoder is mainly to produce a latent compressed representation of the input vector? I am talking about cases where we can essentially dispose of the decoder part of the model after training.
For example, if I use an LSTM without attention, the "classic" approach is to use the last hidden state as the context vector - it should represent the main features of my input sequence. If I were to use an LSTM with attention, my latent representation would have to be all hidden states per time step. This doesn't seem to fit into the notion of input compression and of keeping the main features. Its likely that the dimensionality may even be siginificantly higher.
Additionally, if I needed to use all hidden states as my latent representation (like in the attention case) - why use attention at all? I could just use all hidden states to initialize the decoder.
The advantages of attention is its ability to identify the information in an input most pertinent to accomplishing a task, increasing performance especially in natural language processing - Google Translate is a bidirectional encoder-decoder RNN with attention mechanisms. The disadvantage is the increased computation.
Attention is a powerful mechanism developed to enhance the performance of the Encoder-Decoder architecture on neural network-based machine translation tasks. Learn more about how this process works and how to implement the approach into your work.
Both the statements are FALSE. Autoencoders are an unsupervised learning technique. The output of an autoencoder are indeed pretty similar, but not exactly the same.
The answer depends very much on what you aim to use the representation from the autoencoder for. Each autoencoder needs something that makes the autoencoding task hard, so it needs a rich intermediate representation to solve the task. It can be either a bottleneck in the architecture (as in the case of the vanilla encoder-decoder model) or adding noise in the source side (you can view BERT as a special case of denoising autoencoder where some input tokens are masked).
If you do not introduce any noise on the source side, the autoencoder would learn to simply copy the input without learning anything beyond the identity of input/output symbols – the attention would break the bottleneck property of the vanilla model. The same holds also for the case of labeling the encoder states.
There are sequence-to-sequence autoencoders (BART, MASS) that use encoder-decoder attention. The generated noise includes masking and randomly permuting tokens. The representation that they learn is then more suitable for sequence-to-sequence tasks (such as text summarization or low-resource machine translation) than representations from encoder-only models such as BERT.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With