Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLP Transformers: Best way to get a fixed sentence embedding-vector shape?

I'm loading a language model from torch hub (CamemBERT a French RoBERTa-based model) and using it do embed some french sentences:

import torch
camembert = torch.hub.load('pytorch/fairseq', 'camembert.v0')
camembert.eval()  # disable dropout (or leave in train mode to finetune)


def embed(sentence):
   tokens = camembert.encode(sentence)
   # Extract all layer's features (layer 0 is the embedding layer)
   all_layers = camembert.extract_features(tokens, return_all_hiddens=True)
   embeddings = all_layers[0]
   return embeddings

# Here we see that the shape of the embedding vector depends on the number of tokens in the sentence

u = embed(sentence="Bonjour, ça va ?")
u.shape # torch.Size([1, 7, 768])
v = embed(sentence="Salut, comment vas-tu ?")
v.shape # torch.Size([1, 9, 768])

Imagine now in order to do some semantic search, I want to calculate the cosine distance between the vectors (tensors in our case) u and v :

cos = torch.nn.CosineSimilarity(dim=1)
cos(u, v) # will throw an error since the shape of `u` is different from the shape of `v`

I'm asking what is the best method to use in order to always get the same embedding shape for a sentence regardless the count of its tokens?

=> The first solution I'm thinking of is calculating the mean on axis=1 (embedding of a sentence is the mean embedding its tokens) since axis=0 and axis=2 have always the same size:

cos = torch.nn.CosineSimilarity(dim=1)
cos(u.mean(axis=1), v.mean(axis=1)) # works now and gives 0.7269

But, I'm afraid that I'm hurting the embedding of the sentence when calculating the mean since it gives the same weight for each token (maybe multiplying by TF-IDF?).

=> The second solution is to pad shorter sentences out. That means:

  • giving a list of sentences to embed at a time (instead of embedding sentence by sentence)
  • look up for the sentence with the longest tokens and embed it, get its shape S
  • for the rest of sentences embed then pad zero to get the same shape S (the sentence has 0 in the rest of dimensions)

What are your thoughts? What other techniques would you use and why?

Thanks in advance!

like image 828
hzitoun Avatar asked Nov 25 '19 11:11

hzitoun


2 Answers

This is quite a general question, as there is no one specific right answer.

As you found out, of course the shapes differ because you get one output per token (depending on the tokenizer, those can be subword units). In other words, you have encoded all tokens into their own vector. What you want is a sentence embedding, and there are a number of ways to get those (with not one specifically right answer).

Particularly for sentence classification, we'd often use the output of the special classification token when the language model has been trained on it (CamemBERT uses <s>). Note that depending on the model, this can be the first (mostly BERT and children; also CamemBERT) or the last token (CTRL, GPT2, OpenAI, XLNet). I would suggest to use this option when available, because that token is trained exactly for this purpose.

If a [CLS] (or <s> or similar) token is not available, there are some other options that fall under the term pooling. Max and mean pooling are often used. What this means is that you take the max value token or the mean over all tokens. As you say, the "danger" is that you then reduce the vector value of the whole sentence to "some average" or "some max" that might not be very representative of the sentence. However, literature shows that this works quite well as well.

As another answer suggests, the layer whose output you use can play a difference as well. IIRC the Google paper on BERT suggests that they got the best score when concatenating the last four layers. This is more advanced and I will not go into it here unless requested.

I have no experience with fairseq, but using the transformers library, I'd write something like this (CamemBERT is available in the library from v2.2.0):

import torch
from transformers import CamembertModel, CamembertTokenizer

text = "Salut, comment vas-tu ?"

tokenizer = CamembertTokenizer.from_pretrained('camembert-base')

# encode() automatically adds the classification token <s>
token_ids = tokenizer.encode(text)
tokens = [tokenizer._convert_id_to_token(idx) for idx in token_ids]
print(tokens)

# unsqueeze token_ids because batch_size=1
token_ids = torch.tensor(token_ids).unsqueeze(0)
print(token_ids)

# load model
model = CamembertModel.from_pretrained('camembert-base')

# forward method returns a tuple (we only want the logits)
# squeeze() because batch_size=1
output = model(token_ids)[0].squeeze()
# only grab output of CLS token (<s>), which is the first token
cls_out = output[0]
print(cls_out.size())

Printed output is (in order) the tokens after tokenisation, the token IDs, and the final size.

['<s>', '▁Salut', ',', '▁comment', '▁vas', '-', 'tu', '▁?', '</s>']
tensor([[   5, 5340,    7,  404, 4660,   26,  744,  106,    6]])
torch.Size([768])
like image 55
Bram Vanroy Avatar answered Sep 29 '22 07:09

Bram Vanroy


Bert-as-service is a great example of doing exactly what you are asking about.

They use padding. But read the FAQ, in terms of which layer to get the representation from how to pool it: long story short, depends on the task.

EDIT: I am not saying "use Bert-as-service"; I am saying "rip off what Bert-as-service does."

In your example, you are getting word embeddings (because of the layer you are extracting from). Here is how Bert-as-service does that. So, it actually shouldn't surprise you that this depends on sentence length.

You then talk about getting sentence embeddings by mean pooling over word embeddings. That is... a way to do it. But, using Bert-as-service as a guide for how to get a fixed-length representation from Bert...

Q: How do you get the fixed representation? Did you do pooling or something?

A: Yes, pooling is required to get a fixed representation of a sentence. In the default strategy REDUCE_MEAN, I take the second-to-last hidden layer of all of the tokens in the sentence and do average pooling.

So, to do Bert-as-service's default behavior, you'd do

def embed(sentence):
   tokens = camembert.encode(sentence)
   # Extract all layer's features (layer 0 is the embedding layer)
   all_layers = camembert.extract_features(tokens, return_all_hiddens=True)
   pooling_layer = all_layers[-2]
   embedded = pooling_layer.mean(1)  # 1 is the dimension you want to average ovber
   # note, using numpy to take the mean is bad if you want to stay on GPU
   return embedded
like image 35
Sam H. Avatar answered Sep 29 '22 07:09

Sam H.