Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BERT sentence embeddings from transformers

I'm trying to get sentence vectors from hidden states in a BERT model. Looking at the huggingface BertModel instructions here, which say:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = BertModel.from_pretrained("bert-base-multilingual-cased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt') 
output = model(**encoded_input)

So first note, as it is on the website, this does /not/ run. You get:

>>> Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'BertTokenizer' object is not callable

But it looks like a minor change fixes it, in that you don't call the tokenizer directly, but ask it to encode the input:

encoded_input = tokenizer.encode(text, return_tensors="pt")
output = model(encoded_input)

OK, that aside, the tensors I get, however, have a different shape than I expected:

>>> output[0].shape
torch.Size([1,11,768])

This is a lot of layers. Which is the correct layer to use for sentence embeddings? [0]? [-1]? Averaging several? I have the goal of being able to do cosine similarity with these, so I need a proper 1xN vector rather than an NxK tensor.

I see that the popular bert-as-a-service project appears to use [0]

Is this correct? Is there documentation for what each of the layers are?

like image 348
Mittenchops Avatar asked Aug 18 '20 03:08

Mittenchops


People also ask

How does BERT generate sentence embeddings?

Sentence-BERT uses a Siamese network like architecture to provide 2 sentences as an input. These 2 sentences are then passed to BERT models and a pooling layer to generate their embeddings. Then use the embeddings for the pair of sentences as inputs to calculate the cosine similarity.

What is sentence transformer BERT?

SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. The initial work is described in our paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. You can use this framework to compute sentence / text embeddings for more than 100 languages.

Is BERT a word embedding or sentence embedding?

The BERT base model uses 12 layers of transformer encoders as discussed, and each output per token from each layer of these can be used as a word embedding!.


Video Answer


2 Answers

While the existing answer of Jindrich is generally correct, it does not address the question entirely. The OP asked which layer he should use to calculate the cosine similarity between sentence embeddings and the short answer to this question is none. A metric like cosine similarity requires that the dimensions of the vector contribute equally and meaningfully, but this is not the case for BERT weights released by the original authors. Jacob Devlin (one of the authors of the BERT paper) wrote:

I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn't mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally).

However, that does not mean you can not use BERT for such a task. It just means that you can not use the pre-trained weights out-of-the-box. You can either train a classifier on top of BERT which learns which sentences are similar (using the [CLS] token) or you can use sentence-transformers which can be used in an unsupervised scenario because they were trained to produce meaningful sentence representations.

like image 63
cronoik Avatar answered Oct 22 '22 19:10

cronoik


I don't think there is single authoritative documentation saying what to use and when. You need to experiment and measure what is best for your task. Recent observations about BERT are nicely summarized in this paper: https://arxiv.org/pdf/2002.12327.pdf.

I think the rule of thumb is:

  • Use the last layer if you are going to fine-tune the model for your specific task. And finetune whenever you can, several hundred or even dozens of training examples are enough.

  • Use some of the middle layers (7-th or 8-th) if you cannot finetune the model. The intuition behind that is that the layers first develop a more and more abstract and general representation of the input. At some point, the representation starts to be more target to the pre-training task.

Bert-as-services uses the last layer by default (but it is configurable). Here, it would be [:, -1]. However, it always returns a list of vectors for all input tokens. The vector corresponding to the first special (so-called [CLS]) token is considered to be the sentence embedding. This where the [0] comes from in the snipper you refer to.

like image 12
Jindřich Avatar answered Oct 22 '22 18:10

Jindřich