I recently went through the Transformer paper from Google Research describing how self-attention layers could completely replace traditional RNN-based sequence encoding layers for machine translation. In Table 1 of the paper, the authors compare the computational complexities of different sequence encoding layers, and state (later on) that self-attention layers are faster than RNN layers when the sequence length <code>n</code> is smaller than the dimensionality of the vector representations <code>d</code>. However, the self-attention layer seems to have an inferior complexity than claimed if my understanding of the computations is correct. Let <code>X</code> be the input to a self-attention layer. Then, <code>X</code> will have shape <code>(n, d)</code> since there are <code>n</code> word-vectors (corresponding to rows) each of dimension <code>d</code>. Computing the output of self-attention requires the following steps (consider single-headed self-attention for simplicity): <ol> <li>Linearly transforming the rows of <code>X</code> to compute the query <code>Q</code>, key <code>K</code>, and value <code>V</code> matrices, each of which has shape <code>(n, d)</code>. This is accomplished by post-multiplying <code>X</code> with 3 learned matrices of shape <code>(d, d)</code>, amounting to a computational complexity of <code>O(n d^2)</code>.</li> <li>Computing the layer output, specified in Equation 1 of the paper as <code>SoftMax(Q Kt / sqrt(d)) V</code>, where the softmax is computed over each row. Computing <code>Q Kt</code> has complexity <code>O(n^2 d)</code>, and post-multiplying the resultant with <code>V</code> has complexity <code>O(n^2 d)</code> as well.</li> </ol> Therefore, the total complexity of the layer is <code>O(n^2 d + n d^2)</code>, which is worse than that of a traditional RNN layer. I obtained the same result for multi-headed attention too, on considering the appropriate intermediate representation dimensionalities (<code>dk</code>, <code>dv</code>) and finally multiplying by the number of heads <code>h</code>. Why have the authors ignored the cost of computing the Query, Key, and Value matrices while reporting total computational complexity? I understand that the proposed layer is fully parallelizable across the <code>n</code> positions, but I believe that Table 1 does not take this into account anyway.

First, you are correct in your complexity calculations. So, what is the source of confusion? When the original Attention paper was first introduced, it didn't require to calculate <code>Q</code>, <code>V</code> and <code>K</code> matrices, as the values were taken directly from the hidden states of the RNNs, and thus the complexity of Attention layer is <code>O(n^2·d)</code>. Now, to understand what <code>Table 1</code> contains please keep in mind how most people scan papers: they read title, abstract, then look at figures and tables. Only then if the results were interesting, they read the paper more thoroughly. So, the main idea of the <code>Attention is all you need</code> paper was to replace the RNN layers completely with attention mechanism in seq2seq setting because RNNs were really slow to train. If you look at the <code>Table 1</code> in this context, you see that it compares RNN, CNN and Attention and highlights the motivation for the paper: using Attention should have been beneficial over RNNs and CNNs. It should have been advantageous in 3 aspects: constant amount of calculation steps, constant amount of operations and lower computational complexity for usual Google setting, where <code>n ~= 100</code> and <code>d ~= 1000</code>. But as any idea, it hit the hard wall of reality. And in reality in order for that great idea to work, they had to add positional encoding, reformulate the Attention and add multiple heads to it. The result is the Transformer architecture which while has the computational complexity of <code>O(n^2·d + n·d^2)</code> still is much faster then RNN (in a sense of wall clock time), and produces better results. So the answer for your question is that attention layer the authors refer to in <code>Table 1</code> is strictly the attention mechanism. It is not the complexity of the Transformer. They are very well aware about the complexity of their model (I quote): <blockquote> Separable convolutions [6], however, decrease the complexity considerably, to <code>O(k·n·d + n·d^2)</code>. Even with <code>k = n</code>, however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model. </blockquote>

Strictly speaking, when considering the complexity of only the self-attention block (Fig 2 left, equation 1) the projection of <code>x</code> to <code>q</code>, <code>k</code> and <code>v</code> is not included in the self-attention. The complexities shown in table 1 are only for the very core of self-attention layer and thus are <code>O(n^2 d)</code>.

Computational Complexity of Self-Attention in the Transformer Model

Tags:

artificial-intelligence

machine-learning

neural-network

deep-learning

nlp

I recently went through the Transformer paper from Google Research describing how self-attention layers could completely replace traditional RNN-based sequence encoding layers for machine translation. In Table 1 of the paper, the authors compare the computational complexities of different sequence encoding layers, and state (later on) that self-attention layers are faster than RNN layers when the sequence length n is smaller than the dimensionality of the vector representations d.

However, the self-attention layer seems to have an inferior complexity than claimed if my understanding of the computations is correct. Let X be the input to a self-attention layer. Then, X will have shape (n, d) since there are n word-vectors (corresponding to rows) each of dimension d. Computing the output of self-attention requires the following steps (consider single-headed self-attention for simplicity):

Linearly transforming the rows of X to compute the query Q, key K, and value V matrices, each of which has shape (n, d). This is accomplished by post-multiplying X with 3 learned matrices of shape (d, d), amounting to a computational complexity of O(n d^2).
Computing the layer output, specified in Equation 1 of the paper as SoftMax(Q Kt / sqrt(d)) V, where the softmax is computed over each row. Computing Q Kt has complexity O(n^2 d), and post-multiplying the resultant with V has complexity O(n^2 d) as well.

Therefore, the total complexity of the layer is O(n^2 d + n d^2), which is worse than that of a traditional RNN layer. I obtained the same result for multi-headed attention too, on considering the appropriate intermediate representation dimensionalities (dk, dv) and finally multiplying by the number of heads h.

Why have the authors ignored the cost of computing the Query, Key, and Value matrices while reporting total computational complexity?

I understand that the proposed layer is fully parallelizable across the n positions, but I believe that Table 1 does not take this into account anyway.

247

asked Jan 13 '21 13:01

Newton

2 Answers

First, you are correct in your complexity calculations. So, what is the source of confusion?

When the original Attention paper was first introduced, it didn't require to calculate Q, V and K matrices, as the values were taken directly from the hidden states of the RNNs, and thus the complexity of Attention layer is O(n^2·d).

Now, to understand what Table 1 contains please keep in mind how most people scan papers: they read title, abstract, then look at figures and tables. Only then if the results were interesting, they read the paper more thoroughly. So, the main idea of the Attention is all you need paper was to replace the RNN layers completely with attention mechanism in seq2seq setting because RNNs were really slow to train. If you look at the Table 1 in this context, you see that it compares RNN, CNN and Attention and highlights the motivation for the paper: using Attention should have been beneficial over RNNs and CNNs. It should have been advantageous in 3 aspects: constant amount of calculation steps, constant amount of operations and lower computational complexity for usual Google setting, where n ~= 100 and d ~= 1000. But as any idea, it hit the hard wall of reality. And in reality in order for that great idea to work, they had to add positional encoding, reformulate the Attention and add multiple heads to it. The result is the Transformer architecture which while has the computational complexity of O(n^2·d + n·d^2) still is much faster then RNN (in a sense of wall clock time), and produces better results.

So the answer for your question is that attention layer the authors refer to in Table 1 is strictly the attention mechanism. It is not the complexity of the Transformer. They are very well aware about the complexity of their model (I quote):

Separable convolutions [6], however, decrease the complexity considerably, to O(k·n·d + n·d^2). Even with k = n, however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model.

136

answered Sep 19 '22 09:09

igrinis

Strictly speaking, when considering the complexity of only the self-attention block (Fig 2 left, equation 1) the projection of x to q, k and v is not included in the self-attention. The complexities shown in table 1 are only for the very core of self-attention layer and thus are O(n^2 d).

answered Sep 20 '22 09:09

Shai

Related questions
                            
                                Using sklearn voting ensemble with partial fit
                            
                                KL Divergence for two probability distributions in PyTorch
                            
                                Bayesian networks tutorial [closed]
                            
                                Getting a low ROC AUC score but a high accuracy
                            
                                Training darknet finishes immediately
                            
                                What does dimensionality reduction mean?
                            
                                How to create a neural network for regression?
                            
                                How to normalize the Train and Test data using MinMaxScaler sklearn
                            
                                Implement Relu derivative in python numpy
                            
                                CARET. Relationship between data splitting and trainControl
                            
                                What is the difference between detach, clone and deepcopy in Pytorch tensors in detail?
                            
                                Xgboost: what is the difference among bst.best_score, bst.best_iteration and bst.best_ntree_limit?
                            
                                Why does TensorFlow always use GPU 0?
                            
                                HBase & Mahout - Using HBase as a Datastore/source for Mahout - Classification
                            
                                What does the copy_initial_weights documentation mean in the higher library for Pytorch?
                            
                                Anomaly detection using Python [closed]
                            
                                Classifiers confidence in opencv face detector
                            
                                C++ Reinforcement Learning Library [closed]
                            
                                How to update Spark MatrixFactorizationModel for ALS
                            
                                How to tune GaussianNB?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With