Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pretraining a language model on a small custom corpus

I was curious if it is possible to use transfer learning in text generation, and re-train/pre-train it on a specific kind of text.

For example, having a pre-trained BERT model and a small corpus of medical (or any "type") text, make a language model that is able to generate medical text. The assumption is that you do not have a huge amount of "medical texts" and that is why you have to use transfer learning.

Putting it as a pipeline, I would describe this as:

  1. Using a pre-trained BERT tokenizer.
  2. Obtaining new tokens from my new text and adding them to the existing pre-trained language model (i.e., vanilla BERT).
  3. Re-training the pre-trained BERT model on the custom corpus with the combined tokenizer.
  4. Generating text that resembles the text within the small custom corpus.

Does this sound familiar? Is it possible with hugging-face?

like image 713
John Sig Avatar asked Apr 24 '20 19:04

John Sig


People also ask

What is Pretraining in NLP?

In AI, pre-training imitates the way human beings process new knowledge. That is: using model parameters of tasks that have been learned before to initialize the model parameters of new tasks. In this way, the old knowledge helps new models successfully perform new tasks from old experience instead of from scratch.

What is model Pretraining?

What is a Pre-trained Model? Simply put, a pre-trained model is a model created by some one else to solve a similar problem. Instead of building a model from scratch to solve a similar problem, you use the model trained on other problem as a starting point. For example, if you want to build a self learning car.

What is a pre-trained language model?

What are pre-trained language models? The intuition behind pre-trained language models is to create a black box which understands the language and can then be asked to do any specific task in that language. The idea is to create the machine equivalent of a ‘well-read’ human being.

Why do we need pre-trained models in NLP?

Then the same model is repurposed to perform different NLP functions on a new dataset. The pre-trained model solves a specific problem and requires fine-tuning, which saves a lot of time and computational resources to build a new language model.

How do I train a Bert model with masked language modeling?

Starting with a pre-trained BERT checkpoint and continuing the pre-training with Masked Language Modeling ( MLM) + Next Sentence Prediction ( NSP) heads (e.g. using BertForPreTraining model) Starting with a pre-trained BERT model with the MLM objective (e.g. using the BertForMaskedLM model assuming we don’t need NSP for the pretraining part.)

What is further pre-training in machine learning?

Further pre-training means take some already pre-trained model, and basically apply transfer learning - use the already saved weights from the trained model and train it on some new domain. This is usually beneficial if you don't have a very large corpora.


1 Answers

I have not heard of the pipeline you just mentioned. In order to construct an LM for your use-case, you have basically two options:

  1. Further training BERT (-base/-large) model on your own corpus. This process is called domain-adaption as also described in this recent paper. This will adapt the learned parameters of BERT model to your specific domain (Bio/Medical text). Nonetheless, for this setting, you will need quite a large corpus to help BERT model better update its parameters.

  2. Using a pre-trained language model that is pre-trained on a large amount of domain-specific text either from the scratch or fine-tuned on vanilla BERT model. As you might know, the vanilla BERT model released by Google has been trained on Wikipedia and BookCorpus text. After the vanilla BERT, researchers have tried to train the BERT architecture on other domains besides the initial data collections. You may be able to use these pre-trained models which have a deep understanding of domain-specific language. For your case, there are some models such as: BioBERT, BlueBERT, and SciBERT.

Is it possible with hugging-face?

I am not sure if huggingface developers have developed a robust approach for pre-training BERT model on custom corpora as claimed their code is still in progress, but if you are interested in doing this step, I suggest using Google research's bert code which has been written in Tensorflow and is totally robust (released by BERT's authors). In their readme and under Pre-training with BERT section, the exact process has been declared. This will provide you with Tensorflow checkpoint, which can be easily converted to Pytorch checkpoint if you'd like to work with Pytorch/Transformers.

like image 81
inverted_index Avatar answered Nov 27 '22 18:11

inverted_index