everyone! I was reading about Bert and wanted to do text classification with its word embeddings. I came across this line of code:
pooled_output, sequence_output = self.bert_layer([input_word_ids, input_mask, segment_ids])
and then:
clf_output = sequence_output[:, 0, :]
out = Dense(1, activation='sigmoid')(clf_output)
But I can't understand the use of pooled output. Doesn't sequence output contain all the information including the word embedding of ['CLS']? If so, why do we have pooled output?
Thanks in advance!
Sequence output is the sequence of hidden-states (embeddings) at the output of the last layer of the BERT model. It includes the embedding of the [CLS] token. Hence, for the sentence "You are on Stackoverflow", it gives 5 embeddings: one embedding for each of the four words (assuming the word "Stackoverflow" was tokenized into a single token) along with the embedding of the [CLS] token. Pooled output is the embedding of the [CLS] token (from Sequence output), further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining. For further details, please refer to the BERT original paper.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With