Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HuggingFace BERT `inputs_embeds` giving unexpected result

The HuggingFace BERT TensorFlow implementation allows us to feed in a precomputed embedding in place of the embedding lookup that is native to BERT. This is done using the model's call method's optional parameter inputs_embeds (in place of input_ids). To test this out, I wanted to make sure that if I did feed in BERT's embedding lookup, I would get the same result as having fed in the input_ids themselves.

The result of BERT's embedding lookup can be obtained by setting the BERT configuration parameter output_hidden_states to True and extracting the first tensor from the last output of the call method. (The remaining 12 outputs correspond to each of the 12 hidden layers.)

Thus, I wrote the following code to test my hypothesis:

import tensorflow as tf
from transformers import BertConfig, BertTokenizer, TFBertModel

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

input_ids = tf.constant(bert_tokenizer.encode("Hello, my dog is cute", add_special_tokens=True))[None, :]
attention_mask = tf.stack([tf.ones(shape=(len(sent),)) for sent in input_ids])
token_type_ids = tf.stack([tf.ones(shape=(len(sent),)) for sent in input_ids])

config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True)
bert_model = TFBertModel.from_pretrained('bert-base-uncased', config=config)

result = bert_model(inputs={'input_ids': input_ids, 
                            'attention_mask': attention_mask, 
                             'token_type_ids': token_type_ids})
inputs_embeds = result[-1][0]
result2 = bert_model(inputs={'inputs_embeds': inputs_embeds, 
                            'attention_mask': attention_mask, 
                             'token_type_ids': token_type_ids})

print(tf.reduce_sum(tf.abs(result[0] - result2[0])))  # 458.2522, should be 0

Again, the output of the call method is a tuple. The first element of this tuple is the output of the last layer of BERT. Thus, I expected result[0] and result2[0] to match. Why is this not the case?

I am using Python 3.6.10 with tensorflow version 2.1.0 and transformers version 2.5.1.

EDIT: Looking at some of the HuggingFace code, it seems that the raw embeddings that are looked up when input_ids is given or assigned when inputs_embeds is given are added to the positional embeddings and token type embeddings before being fed into subsequent layers. If this is the case, then it may be possible that what I'm getting from result[-1][0] is the raw embedding plus the positional and token type embeddings. This would mean that they are erroneously getting added in again when I feed result[-1][0] as inputs_embeds in order to calculate result2.

Could someone please tell me if this is the case and if so, please explain how to get the positional and token type embeddings, so I can subtract them out? Below is what I came up with for positional embeddings based on the equations given here (but according to the BERT paper, the positional embeddings may actually be learned, so I'm not sure if these are valid):

import numpy as np

positional_embeddings = np.stack([np.zeros(shape=(len(sent),768)) for sent in input_ids])
for s in range(len(positional_embeddings)):
    for i in range(len(positional_embeddings[s])):
        for j in range(len(positional_embeddings[s][i])):
            if j % 2 == 0:
                positional_embeddings[s][i][j] = np.sin(i/np.power(10000., j/768.))
            else:
                positional_embeddings[s][i][j] = np.cos(i/np.power(10000., (j-1.)/768.))
positional_embeddings = tf.constant(positional_embeddings)
inputs_embeds += positional_embeddings
like image 751
Vivek Subramanian Avatar asked May 02 '20 23:05

Vivek Subramanian


People also ask

What is Hidden_size in BERT?

hidden_size ( int , optional, defaults to 768) — Dimensionality of the encoder layers and the pooler layer. num_hidden_layers ( int , optional, defaults to 12) — Number of hidden layers in the Transformer encoder.

What is Tfbertmodel?

🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep ...

What is Layer_norm_eps?

layer_norm_eps – the eps value in layer normalization components (default=1e-5). batch_first – If True , then the input and output tensors are provided as (batch, seq, feature). Default: False (seq, batch, feature).

What is BERT base uncased?

BERT base model (uncased) Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is case-sensitive: it makes a difference between english and English.


1 Answers

My intuition about positional and token type embeddings being added in turned out to be correct. After looking closely at the code, I replaced the line:

inputs_embeds = result[-1][0]

with the lines:

embeddings = bert_model.bert.get_input_embeddings().word_embeddings
inputs_embeds = tf.gather(embeddings, input_ids)

Now, the difference is 0.0, as expected.

like image 66
Vivek Subramanian Avatar answered Sep 27 '22 17:09

Vivek Subramanian