Updating a BERT model through Huggingface transformers

Tags:

I am attempting to update the pre-trained BERT model using an in house corpus. I have looked at the Huggingface transformer docs and I am a little stuck as you will see below.My goal is to compute simple similarities between sentences using the cosine distance but I need to update the pre-trained model for my specific use case.

If you look at the code below, which is precisely from the Huggingface docs. I am attempting to "retrain" or update the model and I assumed that special_token_1 and special_token_2 represent "new sentences" from my "in house" data or corpus. Is this correct? In summary, I like the already pre-trained BERT model but I would like to update it or retrain it using another in house dataset. Any leads will be appreciated.

import tensorflow as tf
import tensorflow_datasets
from transformers import *

model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

SPECIAL_TOKEN_1="dogs are very cute"
SPECIAL_TOKEN_2="dogs are cute but i like cats better and my 
brother thinks they are more cute"

tokenizer.add_tokens([SPECIAL_TOKEN_1, SPECIAL_TOKEN_2])
model.resize_token_embeddings(len(tokenizer))
#Train our model
model.train()
model.eval()

592

asked Oct 30 '19 07:10

user8291021

1 Answers

BERT is pre-trained on 2 tasks: masked language modeling (MLM) and next sentence prediction (NSP). The most important of those two is MLM (it turns out that the next sentence prediction task is not really that helpful for the model's language understanding capabilities - RoBERTa for example is only pre-trained on MLM).

If you want to further train the model on your own dataset, you can do so by using BERTForMaskedLM in the Transformers repository. This is BERT with a language modeling head on top, which allows you to perform masked language modeling (i.e. predicting masked tokens) on your own dataset. Here's how to use it:

from transformers import BertTokenizer, BertForMaskedLM 
import torch   

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') 
model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True) 

inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt") 
labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]

outputs = model(**inputs, labels=labels) 
loss = outputs.loss 
logits = outputs.logits

You can update the weights of BertForMaskedLM using loss.backward(), which is the main way of training PyTorch models. If you don't want to do this yourself, the Transformers library also provides a Python script which allows you perform MLM really quickly on your own dataset. See here (section "RoBERTa/BERT/DistilBERT and masked language modeling"). You just need to provide a training and test file.

You don't need to add any special tokens. Examples of special tokens are [CLS] and [SEP], which are used for sequence classification and question answering tasks (among others). These are added by the tokenizer automatically. How do I know this? Because BertTokenizer inherits from PretrainedTokenizer, and if you take a look at the documentation of its __call__ method here, you can see that the add_special_tokens parameter defaults to True.

149

answered Oct 20 '22 07:10

Niels

Related questions
                            
                                How to control memory while using Keras with tensorflow backend?
                            
                                Tensorflow - How to implement hyper parameters random search?
                            
                                Tensorflow LSTM Dropout Implementation
                            
                                what is the difference between sampled_softmax_loss and nce_loss in tensorflow?
                            
                                keras model.fit_generator() several times slower than model.fit()
                            
                                Tensorflow: simultaneous prediction on GPU and CPU
                            
                                Why using Anaconda environments to install tensorflow on Windows?
                            
                                LSTM Autoencoder no progress when script is running on larger dataset
                            
                                Can TensorFlow run with multiple CPUs (no GPUs)?
                            
                                TensorFlow: Why does avg_pool ignore one stride dimension?
                            
                                Run Identical model on multiple GPUs, but send different user data to each GPU
                            
                                Upgrade to tf.dataset not working properly when parsing csv
                            
                                How do I keep track of the time the CPU is used vs the GPUs for deep learning?
                            
                                Tensorflow Will Not Import Due to libcublas Issue
                            
                                How to clear out/delete tensors in tensorflow?
                            
                                Can't save/load model using keras.load_model - IndexError: list index out of range
                            
                                Hot to fix Tensorflow model not running in Eager mode with .fit()?
                            
                                TF 2.0: Where can I find the upgrade of tf.contrib.training?
                            
                                Tensorflow: create tf.NodeDef() and set attributes
                            
                                Why I am getting DatasetV1Adapter return type instead of TensorSliceDataset for tf.data.Dataset.from_tensor_slices(X)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Updating a BERT model through Huggingface transformers

Tags:

tensorflow

nlp

pytorch

spacy

huggingface-transformers

user8291021

People also ask

1 Answers

Niels

Recent Activity

Donate For Us