Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to train a neural network model with bert embeddings instead of static embeddings like glove/fasttext?

I am looking for some heads up to train a conventional neural network model with bert embeddings that are generated dynamically (BERT contextualized embeddings which generates different embeddings for the same word which when comes under different context).

In normal neural network model, we would initialize the model with glove or fasttext embeddings like,

import torch.nn as nn 

embed = nn.Embedding(vocab_size, vector_size)

embed.weight.data.copy_(some_variable_containing_vectors)

Instead of copying static vectors like this and use it for training, I want to pass every input to a BERT model and generate embedding for the words on the fly, and feed them to the model for training.

So should I work on changing the forward function in the model for incorporating those embeddings?

Any help would be appreciated!

like image 252
Arjun Sankarlal Avatar asked Mar 27 '19 04:03

Arjun Sankarlal


People also ask

How are BERT embeddings trained?

BERT is trained on and expects sentence pairs, using 1s and 0s to distinguish between the two sentences. That is, for each token in “tokenized_text,” we must specify which sentence it belongs to: sentence 0 (a series of 0s) or sentence 1 (a series of 1s).

Can I use BERT for word Embeddings?

The BERT base model uses 12 layers of transformer encoders as discussed, and each output per token from each layer of these can be used as a word embedding!.

Does BERT have an embedding layer?

BERT uses Wordpiece embeddings input for tokens. Along with token embeddings, BERT uses positional embeddings and segment embeddings for each token. Positional embeddings contain information about the position of tokens in sequence. Segment embeddings help when model input has sentence pairs.

How are BERT embeddings created?

The model must know whether a particular token belongs to sentence A or sentence B in BERT. This is achieved by generating another, fixed token, called the segment embedding – a fixed token for sentence A and one for sentence B. There are just two vector representations in the Segment Embeddings layer.


1 Answers

If you are using Pytorch. You can use https://github.com/huggingface/pytorch-pretrained-BERT which is the most popular BERT implementation for Pytorch (it is also a pip package!). Here I'm just going to outline how to use it properly.

For this particular problem there are 2 approaches - where you obviously cannot use the Embedding layer:

  1. You can incorporate generating BERT embeddings into your data preprocessing pipeline. You will need to use BERT's own tokenizer and word-to-ids dictionary. The repo's README has examples on preprocessing.

You can write a loop for generating BERT tokens for strings like this (assuming - because BERT consumes a lot of GPU memory):

(Note: to be more proper you should also add attention masks - which are LongTensor of 1 & 0 masking the sentence lengths)

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel

batch_size = 32
X_train, y_train = samples_from_file('train.csv') # Put your own data loading function here
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
X_train = [tokenizer.tokenize('[CLS] ' + sent + ' [SEP]') for sent in X_train] # Appending [CLS] and [SEP] tokens - this probably can be done in a cleaner way
bert_model = BertModel.from_pretrained('bert-base-uncased')
bert_model = bert_model.cuda()

X_train_tokens = [tokenizer.convert_tokens_to_ids(sent) for sent in X_train]
results = torch.zeros((len(X_test_tokens), bert_model.config.hidden_size)).long()
with torch.no_grad():
    for stidx in range(0, len(X_test_tokens), batch_size):
        X = X_test_tokens[stidx:stidx + batch_size]
        X = torch.LongTensor(X).cuda()
        _, pooled_output = bert_model(X)
        results[stidx:stidx + batch_size,:] = pooled_output.cpu()

After which you obtain the results tensor which contains the calculated embeddings, where you can use it as an input to your model.

The full (and more proper) code for this is provided here

This method has the advantage of not having to re-calculate these embeddings every epoch.

With this method, e.g for classification your model should only consist of a Linear(bert_model.config.hidden_size, num_labels) layer, inputs to the model should be the results tensor in the above code

  1. Second, and arguably cleaner method: If you check out the repo, you can find there is wrappers for various tasks (e.g BertForSequenceClassification). It should also be easy to implement your custom classes that inherits from BertPretrainedModel and utilizes the various Bert classes from the repo.

For example, you can use:

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', labels=num_labels) # Where num_labels is the number of labels you need to classify.

After which you can continue with the preprocessing, up until generating token ids. Then you can train the entire model (but with a low learning rate e.g Adam 3e-5 for batch_size = 32)

With this you can fine-tune BERT's embeddings itself, or use techniques like freezing BERT for a few epochs to train the classifier only, then unfreeze to fine-tune etc. But it is also more computationally expensive.

An example for this is also provided in the repo

like image 131
luungoc2005 Avatar answered Sep 21 '22 01:09

luungoc2005