Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fine-tune Bert for specific domain (unsupervised)

I want to fine-tune BERT on texts that are related to a specific domain (in my case related to engineering). The training should be unsupervised since I don't have any labels or anything. Is this possible?

like image 351
spadel Avatar asked Nov 06 '20 09:11

spadel


People also ask

Is BERT training unsupervised?

Unlike previous models, BERT is a deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus.

Can BERT be used without fine-tuning?

Fine-tuning is not always necessary. Instead, the feature-based approach, where we simply extract pre-trained BERT embeddings as features, can be a viable, and cheap, alternative. However, it's important to not use just the final layer, but at least the last 4, or all of them.

What happens to BERT Embeddings during fine-tuning?

We instead find that fine-tuning primarily affects the top layers of BERT, but with noteworthy variation across tasks. In particular, dependency parsing reconfigures most of the model, whereas SQuAD and MNLI appear to involve much shallower processing.


1 Answers

What you in fact want to is continue pre-training BERT on text from your specific domain. What you do in this case is to continue training the model as masked language model, but on your domain-specific data.

You can use the run_mlm.py script from the Huggingface's Transformers.

like image 153
Jindřich Avatar answered Oct 20 '22 06:10

Jindřich