The HuggingFace BERT TensorFlow implementation allows us to feed in a precomputed embedding in place of the embedding lookup that is native to BERT. This is done using the model's call
method's optional parameter inputs_embeds
(in place of input_ids
). To test this out, I wanted to make sure that if I did feed in BERT's embedding lookup, I would get the same result as having fed in the input_ids
themselves.
The result of BERT's embedding lookup can be obtained by setting the BERT configuration parameter output_hidden_states
to True
and extracting the first tensor from the last output of the call
method. (The remaining 12 outputs correspond to each of the 12 hidden layers.)
Thus, I wrote the following code to test my hypothesis:
import tensorflow as tf
from transformers import BertConfig, BertTokenizer, TFBertModel
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
input_ids = tf.constant(bert_tokenizer.encode("Hello, my dog is cute", add_special_tokens=True))[None, :]
attention_mask = tf.stack([tf.ones(shape=(len(sent),)) for sent in input_ids])
token_type_ids = tf.stack([tf.ones(shape=(len(sent),)) for sent in input_ids])
config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True)
bert_model = TFBertModel.from_pretrained('bert-base-uncased', config=config)
result = bert_model(inputs={'input_ids': input_ids,
'attention_mask': attention_mask,
'token_type_ids': token_type_ids})
inputs_embeds = result[-1][0]
result2 = bert_model(inputs={'inputs_embeds': inputs_embeds,
'attention_mask': attention_mask,
'token_type_ids': token_type_ids})
print(tf.reduce_sum(tf.abs(result[0] - result2[0]))) # 458.2522, should be 0
Again, the output of the call
method is a tuple. The first element of this tuple is the output of the last layer of BERT. Thus, I expected result[0]
and result2[0]
to match. Why is this not the case?
I am using Python 3.6.10 with tensorflow
version 2.1.0 and transformers
version 2.5.1.
EDIT: Looking at some of the HuggingFace code, it seems that the raw embeddings that are looked up when input_ids
is given or assigned when inputs_embeds
is given are added to the positional embeddings and token type embeddings before being fed into subsequent layers. If this is the case, then it may be possible that what I'm getting from result[-1][0]
is the raw embedding plus the positional and token type embeddings. This would mean that they are erroneously getting added in again when I feed result[-1][0]
as inputs_embeds
in order to calculate result2
.
Could someone please tell me if this is the case and if so, please explain how to get the positional and token type embeddings, so I can subtract them out? Below is what I came up with for positional embeddings based on the equations given here (but according to the BERT paper, the positional embeddings may actually be learned, so I'm not sure if these are valid):
import numpy as np
positional_embeddings = np.stack([np.zeros(shape=(len(sent),768)) for sent in input_ids])
for s in range(len(positional_embeddings)):
for i in range(len(positional_embeddings[s])):
for j in range(len(positional_embeddings[s][i])):
if j % 2 == 0:
positional_embeddings[s][i][j] = np.sin(i/np.power(10000., j/768.))
else:
positional_embeddings[s][i][j] = np.cos(i/np.power(10000., (j-1.)/768.))
positional_embeddings = tf.constant(positional_embeddings)
inputs_embeds += positional_embeddings
hidden_size ( int , optional, defaults to 768) — Dimensionality of the encoder layers and the pooler layer. num_hidden_layers ( int , optional, defaults to 12) — Number of hidden layers in the Transformer encoder.
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep ...
layer_norm_eps – the eps value in layer normalization components (default=1e-5). batch_first – If True , then the input and output tensors are provided as (batch, seq, feature). Default: False (seq, batch, feature).
BERT base model (uncased) Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is case-sensitive: it makes a difference between english and English.
My intuition about positional and token type embeddings being added in turned out to be correct. After looking closely at the code, I replaced the line:
inputs_embeds = result[-1][0]
with the lines:
embeddings = bert_model.bert.get_input_embeddings().word_embeddings
inputs_embeds = tf.gather(embeddings, input_ids)
Now, the difference is 0.0, as expected.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With