I want to fine-tune BERT on texts that are related to a specific domain (in my case related to engineering). The training should be unsupervised since I don't have any labels or anything. Is this possible?
Unlike previous models, BERT is a deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus.
Fine-tuning is not always necessary. Instead, the feature-based approach, where we simply extract pre-trained BERT embeddings as features, can be a viable, and cheap, alternative. However, it's important to not use just the final layer, but at least the last 4, or all of them.
We instead find that fine-tuning primarily affects the top layers of BERT, but with noteworthy variation across tasks. In particular, dependency parsing reconfigures most of the model, whereas SQuAD and MNLI appear to involve much shallower processing.
What you in fact want to is continue pre-training BERT on text from your specific domain. What you do in this case is to continue training the model as masked language model, but on your domain-specific data.
You can use the run_mlm.py
script from the Huggingface's Transformers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With