How Transformer is Bidirectional - Machine Learning

Tags:

machine-learning

I am coming from Google BERT context (Bidirectional Encoder representations from Transformers). I have gone through architecture and codes. People say this is bidirectional by nature. To make it unidirectional attention some mask is to be applied.

Basically a transformer takes key, values and queries as input; uses encoder decoder architecture; and applies attention to these keys, queries and values. What I understood is we need to pass tokens explicitly rather than transformer understanding this by nature.

Can someone please explain what makes transformer is bidirectional by nature

506

asked Mar 14 '19 09:03

Sandeep Bhutani

2 Answers

Bidirectional is actually a carry-over term from RNN/LSTM. Transformer is much more than that.

Transformer and BERT can directly access all positions in the sequence, equivalent to having full random access memory of the sequence during encoding/decoding.

Classic RNN has only access to the hidden state and last token, e.g. encoding of word3 = f(hidden_state, word2), so it has to compress all previous words into a hidden state vector, theoretically possible but a severe limitation in practice. Bidirectional RNN/LSTM is slightly better. Memory networks is another way to work around this. Attention is yet another way to improve LSTM seq2seq models. The insight for Transformer is that you want full memory access and don't need the RNN at all!

Another piece of history: an important ingredient that let us deal with sequence structure without using RNN is positional encoding, which comes from CNN seq2seq model. It would not have been possible without this. Turns out, you don't need the CNN either, as CNN doesn't have full random access, but each convolution filter can only look at a number of neighboring words at a time.

Hence, Transformer is more like FFN, where encoding of word1 = f1(word1, word2, word3), and encoding of word3 = f2(word1, word2, word3). All positions available all the time.

You might also appreciate the beauty which is that the authors made it possible to compute attention for all positions in parallel, through the use of Q, K, V matrices. It's quite magical!

But understand this, you'll also appreciate the limitations of Transformer, that it requires O(N^2 * d) computation where N is the sequence length, precisely because we're doing N*N attention of all words with all other words. RNN, on the other hand, is linear in the sequence length and requires O(N * d^2) computation. d is the dimension of model hidden state.

Transformer just won't write a novel anytime soon!

106

answered Oct 19 '22 19:10

Hai-Anh Trinh

On the following picture you will see in a really clear way why BERT is Bidirectional.
enter image description here

This is crucial since this forces the model to use information from the entire sentence simultaneously – regardless of the position – to make a good predictions. BERT has been a clear break through allowed by the use of the notorious "attention is all you need" paper and architecture.

This Bidirectional idea (masked) is different from classic LSTM cells which till now used the forward or the backward method or both but not at the same time.

Edit:

this is done by the transformer. The attention is all you need paper is presenting an encoder-decoder system implementing a sequence to sequence framework. BERT is using this Transformer (sequence to sequence Bidirectional network) to do other NLP task. And this has been done by using a masked approach.

The important thing is: BERT uses Attention but Attention has been done for a translation and as such do not care for Bidirectional. But remove a word and you have Bidirectional.

So why BERT now?

well the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. Meaning that this model allows a sentence Embedding far more effective than before. In fact, RNN based architectures are hard to parallelize and can have difficulty learning long-range dependencies within the input and output sequences. SO break through in architecture AND the use of this idea to train a network by masking a word (or more) leads to BERT.

Edit of Edit: forget about the scale product, it's the inside the Attention which is inside A multi head attention itself inside the Transformer: you are looking to deep. The transformer is using the entire sequence every time to find the other sequence (In case of BERT it's the missing 0.15 percentage of the sentence) and that's it. The use of BERT as a language model is realy a transfer learning (see this) As stated in your post, unidirectional can be done with a certain type of mask, bidirec is better. And it is used because the go from a full sentence to a full sentence but not the way classic Seq2seq is made (with LSTM and RNN) and as such can be used for LM.

answered Oct 19 '22 17:10

Frayal

Related questions
                            
                                What's the difference between optimizer.compute_gradient() and tf.gradients() in tensorflow?
                            
                                CNN that generate a new image from input image
                            
                                Logistic Regression: How to find top three feature that have highest weights?
                            
                                Tensorflow failed to create a newwriteablefile when retraining inception
                            
                                How to determine an overfitted model based on loss precision and recall
                            
                                XGBoost - n_estimators = 1 equal to single-tree classifier?
                            
                                fastai learner requirements and batch prediction
                            
                                Keras: Making a neural network to find a number's modulus
                            
                                Keras Multitask learning with two different input sample size
                            
                                How to get the nearest neighbor in weka using java
                            
                                The relationship between latent Dirichlet allocation and documents clustering
                            
                                Can TF/IDF take classes in account
                            
                                Defining a gradient with respect to a subtensor in Theano
                            
                                Numpy linear regression with regularization
                            
                                How to interpret `scipy.stats.kstest` and `ks_2samp` to evaluate `fit` of data to a distribution?
                            
                                Are modern CNN (convolutional neural network) as DetectNet rotate invariant?
                            
                                Getting 'ValueError: shapes not aligned' on SciKit Linear Regression
                            
                                Tensorflow estimator: average_loss vs loss
                            
                                Trainable sklearn StandardScaler for R
                            
                                is it possible to implement dynamic class weights in keras?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With