Pretraining a language model on a small custom corpus

Tags:

I was curious if it is possible to use transfer learning in text generation, and re-train/pre-train it on a specific kind of text.

For example, having a pre-trained BERT model and a small corpus of medical (or any "type") text, make a language model that is able to generate medical text. The assumption is that you do not have a huge amount of "medical texts" and that is why you have to use transfer learning.

Putting it as a pipeline, I would describe this as:

Using a pre-trained BERT tokenizer.
Obtaining new tokens from my new text and adding them to the existing pre-trained language model (i.e., vanilla BERT).
Re-training the pre-trained BERT model on the custom corpus with the combined tokenizer.
Generating text that resembles the text within the small custom corpus.

Does this sound familiar? Is it possible with hugging-face?

713

asked Apr 24 '20 19:04

John Sig

1 Answers

I have not heard of the pipeline you just mentioned. In order to construct an LM for your use-case, you have basically two options:

Further training BERT (-base/-large) model on your own corpus. This process is called domain-adaption as also described in this recent paper. This will adapt the learned parameters of BERT model to your specific domain (Bio/Medical text). Nonetheless, for this setting, you will need quite a large corpus to help BERT model better update its parameters.
Using a pre-trained language model that is pre-trained on a large amount of domain-specific text either from the scratch or fine-tuned on vanilla BERT model. As you might know, the vanilla BERT model released by Google has been trained on Wikipedia and BookCorpus text. After the vanilla BERT, researchers have tried to train the BERT architecture on other domains besides the initial data collections. You may be able to use these pre-trained models which have a deep understanding of domain-specific language. For your case, there are some models such as: BioBERT, BlueBERT, and SciBERT.

Is it possible with hugging-face?

I am not sure if huggingface developers have developed a robust approach for pre-training BERT model on custom corpora as claimed their code is still in progress, but if you are interested in doing this step, I suggest using Google research's bert code which has been written in Tensorflow and is totally robust (released by BERT's authors). In their readme and under Pre-training with BERT section, the exact process has been declared. This will provide you with Tensorflow checkpoint, which can be easily converted to Pytorch checkpoint if you'd like to work with Pytorch/Transformers.

answered Nov 27 '22 18:11

inverted_index

Related questions
                            
                                How to deep learn from a row of numbers using Node.js and convnetjs and predicted a new value?
                            
                                How to use the vgg-net when I load vgg16_weights.h5?
                            
                                Accessing neural network weights and neuron activations
                            
                                What is the difference between classification and pattern recognition?
                            
                                How to split a tensor column-wise in Keras to implement STFCN
                            
                                patch-wise training and fully convolutional training in FCN
                            
                                How to asynchronously load and train batches to train a DeepLearning model?
                            
                                Keras: model accuracy drops after reaching 99 percent accuracy and loss 0.01
                            
                                How can I improve the classification accuracy of LSTM,GRU recurrent neural networks
                            
                                ssh AWS, Jupyter Notebook not showing up on web browser
                            
                                How to properly feed specific tensor to keras model
                            
                                Store Tensorflow object detection API image output with boxes in CSV format
                            
                                How to overcome overfitting in CNN - standard methods don't work
                            
                                Mini batch training for inputs of variable sizes
                            
                                Variable size input for LSTM in Pytorch
                            
                                How to get all layers' activations for a specific input for Tensorflow Hub modules?
                            
                                How to train Siamese network in Keras?
                            
                                Using different sample weights for each output in a multi-output Keras model
                            
                                Difference between feature_column.embedding_column and keras.layers.Embedding in TensorFlow
                            
                                How to add a new class to an existing classifier in deep learning?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pretraining a language model on a small custom corpus

Tags:

deep-learning

bert-language-model

transfer-learning

huggingface-transformers

language-model

John Sig

People also ask

1 Answers

inverted_index

Recent Activity

Donate For Us