Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BERT - Pooled output is different from first vector of sequence output

I am using BERT in Tensorflow and there is one detail I dont quite understand. Accordin the the documentation (https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1), pooled output is the of the entire sequence. Based on the original paper, it seems like this is the output for the token "CLS" at the beginning of the setence.

pooled_output[0]

However, when I look at the output corresponding to the first token in the sentence

setence_output[0,0,:]

which I believe corresponds to the token "CLS" (the first token in the sentence), the 2 results are different.

like image 991
TDo Avatar asked Apr 20 '20 21:04

TDo


People also ask

What is sequence output and pooled output?

Pooled output is the embedding of the [CLS] token (from Sequence output), further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

What are the outputs of BERT model?

The output of the BERT is the hidden state vector of pre-defined hidden size corresponding to each token in the input sequence. These hidden states from the last layer of the BERT are then used for various NLP tasks.

What is pooling in BERT?

That's the embedding of the initial CLS token. It's "pooled" from all input tokens in the sense that the multiple attention layers will force it to depend on all other tokens.

What is sequence output?

Sequential output is basically output of a machine which is working on a fixed sequence. We start with an input and following a fixed pattern or sequence machine gives us the final output.


4 Answers

The intention of pooled_output and sequence_output are different. Since, the embeddings from the BERT model at the output layer are known to be contextual embeddings, the output of the 1st token, i.e, [CLS] token would have captured sufficient context. Hence, the authors of BERT paper found it sufficient to use only the output from the 1st token for few tasks such as classification. They call this output from the single token (i.e, 1st token) as pooled_output.

Since the source code of the TF Hub module is not available, and assuming that the TFHub would use the same implementation as the open-sourced version of the code by authors of BERT (https://github.com/google-research/bert/). As given by the source code of modeling.py script (https://github.com/google-research/bert/blob/bee6030e31e42a9394ac567da170a89a98d2062f/modeling.py), the pooled_output (often called by get_pooled_output() function), returns the activations from the hidden state of the 1st token.

like image 123
Ashwin Geet D'Sa Avatar answered Oct 17 '22 18:10

Ashwin Geet D'Sa


As mentioned in Huggingface documentation for output of BertModel, pooler output is:

Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function.

So because of further processed by a Linear layer and a Tanh activation function, the output of first vector of sequence output (CLS token) and pooled output do not have the same values (but the size of both vectors is the same)

like image 6
Masoud Gheisari Avatar answered Oct 17 '22 20:10

Masoud Gheisari


I encountered similar problem when I was using BertModel from transformers library and I figure your question may be the same. Here’s what I found:

Outputs of BertModel contain a sequence_output (normally of shape [batch_size, max_sequence_length, 768]),which is last layer of Bert. It also has a pooled_output (normally of shape [batch_size, 768]), which is output of an additional “pooler” layer. Pooler layer takes sequence_output[:, 0](first token, i.e. CLS token) followed by dense layer and Tanh activation.

That’s where pooled_output got its name and why it’s different from CLS token, but both should serve the same purpose.

like image 1
Ley Avatar answered Oct 17 '22 19:10

Ley


pooled_output[0] != setence_output[0,0,:]

setence_output : Is simply an array (representation) of the last layer hidden representation of each token which will be of size (batch_size, seq_len, hidden_size)

pooler_output: Is representation/embedding of CLS token passed through some more layers Bertpooler, linear/dense and activation function. It is recommended to use this pooler_output as it contains contextualize information of whole sequence.

like image 1
Pranav Kushare Avatar answered Oct 17 '22 18:10

Pranav Kushare