I am looking for some heads up to train a conventional neural network model with bert embeddings that are generated dynamically (BERT contextualized embeddings which generates different embeddings for the same word which when comes under different context).
In normal neural network model, we would initialize the model with glove or fasttext embeddings like,
import torch.nn as nn
embed = nn.Embedding(vocab_size, vector_size)
embed.weight.data.copy_(some_variable_containing_vectors)
Instead of copying static vectors like this and use it for training, I want to pass every input to a BERT model and generate embedding for the words on the fly, and feed them to the model for training.
So should I work on changing the forward function in the model for incorporating those embeddings?
Any help would be appreciated!
BERT is trained on and expects sentence pairs, using 1s and 0s to distinguish between the two sentences. That is, for each token in “tokenized_text,” we must specify which sentence it belongs to: sentence 0 (a series of 0s) or sentence 1 (a series of 1s).
The BERT base model uses 12 layers of transformer encoders as discussed, and each output per token from each layer of these can be used as a word embedding!.
BERT uses Wordpiece embeddings input for tokens. Along with token embeddings, BERT uses positional embeddings and segment embeddings for each token. Positional embeddings contain information about the position of tokens in sequence. Segment embeddings help when model input has sentence pairs.
The model must know whether a particular token belongs to sentence A or sentence B in BERT. This is achieved by generating another, fixed token, called the segment embedding – a fixed token for sentence A and one for sentence B. There are just two vector representations in the Segment Embeddings layer.
If you are using Pytorch. You can use https://github.com/huggingface/pytorch-pretrained-BERT which is the most popular BERT implementation for Pytorch (it is also a pip package!). Here I'm just going to outline how to use it properly.
For this particular problem there are 2 approaches - where you obviously cannot use the Embedding
layer:
You can write a loop for generating BERT tokens for strings like this (assuming - because BERT consumes a lot of GPU memory):
(Note: to be more proper you should also add attention masks - which are LongTensor of 1 & 0 masking the sentence lengths)
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel
batch_size = 32
X_train, y_train = samples_from_file('train.csv') # Put your own data loading function here
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
X_train = [tokenizer.tokenize('[CLS] ' + sent + ' [SEP]') for sent in X_train] # Appending [CLS] and [SEP] tokens - this probably can be done in a cleaner way
bert_model = BertModel.from_pretrained('bert-base-uncased')
bert_model = bert_model.cuda()
X_train_tokens = [tokenizer.convert_tokens_to_ids(sent) for sent in X_train]
results = torch.zeros((len(X_test_tokens), bert_model.config.hidden_size)).long()
with torch.no_grad():
for stidx in range(0, len(X_test_tokens), batch_size):
X = X_test_tokens[stidx:stidx + batch_size]
X = torch.LongTensor(X).cuda()
_, pooled_output = bert_model(X)
results[stidx:stidx + batch_size,:] = pooled_output.cpu()
After which you obtain the results
tensor which contains the calculated embeddings, where you can use it as an input to your model.
The full (and more proper) code for this is provided here
This method has the advantage of not having to re-calculate these embeddings every epoch.
With this method, e.g for classification your model should only consist of a Linear(bert_model.config.hidden_size, num_labels)
layer, inputs to the model should be the results
tensor in the above code
BertForSequenceClassification
). It should also be easy to implement your custom classes that inherits from BertPretrainedModel
and utilizes the various Bert classes from the repo.For example, you can use:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', labels=num_labels) # Where num_labels is the number of labels you need to classify.
After which you can continue with the preprocessing, up until generating token ids. Then you can train the entire model (but with a low learning rate e.g Adam 3e-5 for batch_size
= 32)
With this you can fine-tune BERT's embeddings itself, or use techniques like freezing BERT for a few epochs to train the classifier only, then unfreeze to fine-tune etc. But it is also more computationally expensive.
An example for this is also provided in the repo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With