I have two questions about how to use Tensorflow implementation of the Transformers for text classifications. <ul> <li> First, it seems people mostly used only the encoder layer to do the text classification task. However, encoder layer generates one prediction for each input word. Based on my understanding of transformers, the input to the encoder each time is one word from the input sentence. Then, the attention weights and the output is calculated using the current input word. And we can repeat this process for all of the words in the input sentence. As a result we'll end up with pairs of (attention weights, outputs) for each word in the input sentence. Is that correct? Then how would you use this pairs to perform a text classification? </li> <li> Second, based on the Tensorflow implementation of transformer here, they embed the whole input sentence to one vector and feed a batch of these vectors to the Transformer. However, I expected the input to be a batch of words instead of sentences based on what I've learned from The Illustrated Transformer </li> </ul> Thank you!

<ol> <li>The Transformers are designed to take the whole input sentence at once. The main motive for designing a transformer was to enable parallel processing of the words in the sentences. This parallel processing is not possible in LSTMs or RNNs or GRUs as they take words of the input sentence as input one by one. So in the encoder part of the transformers, the very first layer contains the number of units equal to the number of words in a sentence and then each unit converts that word into an embedding vector corresponding to that word. Further, the rest of the processes are carried out. For more details, you can go through the article: http://jalammar.github.io/illustrated-transformer/ How to use this transformer for text classification - Since in text classification our output is a single number not a sequence of numbers or vectors so we can remove the decoder part and just use the encoder part. The output of the encoder is a set of vectors, the same in number as the number of words in the input sentence. Further, we can feed these sets of output vectors into a CNN, or we can add an LSTM or RNN model and perform classification.</li> <li>The input is the whole sentence or batch of sentences not word by word. Surely you would have misunderstood it. </li> </ol>

How to use Transformers for text classification?

Tags:

tensorflow

nlp

bert-language-model

transformer

I have two questions about how to use Tensorflow implementation of the Transformers for text classifications.

First, it seems people mostly used only the encoder layer to do the text classification task. However, encoder layer generates one prediction for each input word. Based on my understanding of transformers, the input to the encoder each time is one word from the input sentence. Then, the attention weights and the output is calculated using the current input word. And we can repeat this process for all of the words in the input sentence. As a result we'll end up with pairs of (attention weights, outputs) for each word in the input sentence. Is that correct? Then how would you use this pairs to perform a text classification?
Second, based on the Tensorflow implementation of transformer here, they embed the whole input sentence to one vector and feed a batch of these vectors to the Transformer. However, I expected the input to be a batch of words instead of sentences based on what I've learned from The Illustrated Transformer

Thank you!

913

asked Sep 26 '19 19:09

khemedi

2 Answers

There are two approaches, you can take:

Just average the states you get from the encoder;
Prepend a special token [CLS] (or whatever you like to call it) and use the hidden state for the special token as input to your classifier.

The second approach is used by BERT. When pre-training, the hidden state corresponding to this special token is used for predicting whether two sentences are consecutive. In the downstream tasks, it is also used for sentence classification. However, my experience is that sometimes, averaging the hidden states give a better result.

Instead of training a Transformer model from scratch, it is probably more convenient to use (and eventually finetune) a pre-trained model (BERT, XLNet, DistilBERT, ...) from the transformers package. It has pre-trained models ready to use in PyTorch and TensorFlow 2.0.

answered Nov 03 '22 03:11

Jindřich

The Transformers are designed to take the whole input sentence at once. The main motive for designing a transformer was to enable parallel processing of the words in the sentences. This parallel processing is not possible in LSTMs or RNNs or GRUs as they take words of the input sentence as input one by one. So in the encoder part of the transformers, the very first layer contains the number of units equal to the number of words in a sentence and then each unit converts that word into an embedding vector corresponding to that word. Further, the rest of the processes are carried out. For more details, you can go through the article: http://jalammar.github.io/illustrated-transformer/ How to use this transformer for text classification - Since in text classification our output is a single number not a sequence of numbers or vectors so we can remove the decoder part and just use the encoder part. The output of the encoder is a set of vectors, the same in number as the number of words in the input sentence. Further, we can feed these sets of output vectors into a CNN, or we can add an LSTM or RNN model and perform classification.
The input is the whole sentence or batch of sentences not word by word. Surely you would have misunderstood it.

answered Nov 03 '22 03:11

Khobaib Alam

Related questions
                            
                                How to set Tensorflow dynamic_rnn, zero_state without a fixed batch_size?
                            
                                How to dynamically freeze weights after compiling model in Keras?
                            
                                Use both sample_weight and class_weight simultaneously
                            
                                How to install CPU version of tensorflow using conda
                            
                                `loss` passed to Optimizer.compute_gradients should be a function when eager execution is enabled
                            
                                What is the batchSize in TensorFlow's model.fit() function?
                            
                                Get labels from dataset when using tensorflow image_dataset_from_directory
                            
                                How to set shape in placeholder for non-deterministic array size
                            
                                Why is the x variable tensor reshaped with -1 in the MNIST tutorial for tensorflow?
                            
                                TensorFlow: How to verify that it is running on GPU
                            
                                How do you compute accuracy in a regression model, after rounding predictions to classes, in keras?
                            
                                Pycharm tensorflow ImportError but works fine with Terminal
                            
                                How to calculate the output size after convolving and pooling to the input image
                            
                                Why does a binary Keras CNN always predict 1?
                            
                                How to use multilayered bidirectional LSTM in Tensorflow?
                            
                                How to use Dataset API to read TFRecords file of lists of variant length?
                            
                                Cannot load tensorflow_hub
                            
                                WARNING:tensorflow:Ignoring detection with image id despite true config parameters
                            
                                What is the backward process of max operation in deep learning?
                            
                                Keras: rescale=1./255 vs preprocessing_function=preprocess_input - which one to use?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With