How to train a neural network model with bert embeddings instead of static embeddings like glove/fasttext?

Tags:

I am looking for some heads up to train a conventional neural network model with bert embeddings that are generated dynamically (BERT contextualized embeddings which generates different embeddings for the same word which when comes under different context).

In normal neural network model, we would initialize the model with glove or fasttext embeddings like,

import torch.nn as nn 

embed = nn.Embedding(vocab_size, vector_size)

embed.weight.data.copy_(some_variable_containing_vectors)

Instead of copying static vectors like this and use it for training, I want to pass every input to a BERT model and generate embedding for the words on the fly, and feed them to the model for training.

So should I work on changing the forward function in the model for incorporating those embeddings?

Any help would be appreciated!

252

asked Mar 27 '19 04:03

Arjun Sankarlal

1 Answers

If you are using Pytorch. You can use https://github.com/huggingface/pytorch-pretrained-BERT which is the most popular BERT implementation for Pytorch (it is also a pip package!). Here I'm just going to outline how to use it properly.

For this particular problem there are 2 approaches - where you obviously cannot use the Embedding layer:

You can incorporate generating BERT embeddings into your data preprocessing pipeline. You will need to use BERT's own tokenizer and word-to-ids dictionary. The repo's README has examples on preprocessing.

You can write a loop for generating BERT tokens for strings like this (assuming - because BERT consumes a lot of GPU memory):

(Note: to be more proper you should also add attention masks - which are LongTensor of 1 & 0 masking the sentence lengths)

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel

batch_size = 32
X_train, y_train = samples_from_file('train.csv') # Put your own data loading function here
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
X_train = [tokenizer.tokenize('[CLS] ' + sent + ' [SEP]') for sent in X_train] # Appending [CLS] and [SEP] tokens - this probably can be done in a cleaner way
bert_model = BertModel.from_pretrained('bert-base-uncased')
bert_model = bert_model.cuda()

X_train_tokens = [tokenizer.convert_tokens_to_ids(sent) for sent in X_train]
results = torch.zeros((len(X_test_tokens), bert_model.config.hidden_size)).long()
with torch.no_grad():
    for stidx in range(0, len(X_test_tokens), batch_size):
        X = X_test_tokens[stidx:stidx + batch_size]
        X = torch.LongTensor(X).cuda()
        _, pooled_output = bert_model(X)
        results[stidx:stidx + batch_size,:] = pooled_output.cpu()

After which you obtain the results tensor which contains the calculated embeddings, where you can use it as an input to your model.

The full (and more proper) code for this is provided here

This method has the advantage of not having to re-calculate these embeddings every epoch.

With this method, e.g for classification your model should only consist of a Linear(bert_model.config.hidden_size, num_labels) layer, inputs to the model should be the results tensor in the above code

Second, and arguably cleaner method: If you check out the repo, you can find there is wrappers for various tasks (e.g BertForSequenceClassification). It should also be easy to implement your custom classes that inherits from BertPretrainedModel and utilizes the various Bert classes from the repo.

For example, you can use:

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', labels=num_labels) # Where num_labels is the number of labels you need to classify.

After which you can continue with the preprocessing, up until generating token ids. Then you can train the entire model (but with a low learning rate e.g Adam 3e-5 for batch_size = 32)

With this you can fine-tune BERT's embeddings itself, or use techniques like freezing BERT for a few epochs to train the classifier only, then unfreeze to fine-tune etc. But it is also more computationally expensive.

An example for this is also provided in the repo

131

answered Sep 21 '22 01:09

luungoc2005

Related questions
                            
                                Empty string with Tesseract
                            
                                `ImportError: No module named AppKit` after installing AppKit and PyObjC
                            
                                How can I implement OpenCV's perspectiveTransform in Python
                            
                                Append list to pandas DataFrame as new row with index
                            
                                How to convert a python script in a local conda env into systemd service in Linux?
                            
                                Why am I receive AlreadyExistsError?
                            
                                LabelEncoder that keeps missing values as 'NaN'
                            
                                How to generate both server and client certificates under root CA
                            
                                Where can I find numpy.where() source code? [duplicate]
                            
                                Python type-hint friendly type that constrains possible values
                            
                                Why is `json.dump()` not ending the line with `\n`?
                            
                                Python: logging comments printed to console before other outputs
                            
                                Wrong current working directory when running python code and jupyter extension in vscode
                            
                                Find elements in a list of which all elements in another list are factors, using a list comprehension
                            
                                Homebrew pyenv install error dyld: Library not loaded: /usr/local/opt/readline/lib/libreadline.7.dylib
                            
                                Python pytest does not show assertion differences
                            
                                /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.21' not found required by TensorFlow
                            
                                How to run flask_migrate in Docker
                            
                                Pytest - testing parser Error : Unrecognised arguments
                            
                                Pandas groupby give any non nan values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to train a neural network model with bert embeddings instead of static embeddings like glove/fasttext?

Tags:

python

artificial-intelligence

machine-learning

neural-network

pytorch

Arjun Sankarlal

People also ask

1 Answers

luungoc2005

Recent Activity

Donate For Us