Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can you train a BERT model from scratch with task specific architecture?

BERT pre-training of the base-model is done by a language modeling approach, where we mask certain percent of tokens in a sentence, and we make the model learn those missing mask. Then, I think in order to do downstream tasks, we add a newly initialized layer and we fine-tune the model.

However, suppose we have a gigantic dataset for sentence classification. Theoretically, can we initialize the BERT base architecture from scratch, train both the additional downstream task specific layer + the base model weights form scratch with this sentence classification dataset only, and still achieve a good result?

Thanks.

like image 804
viopu Avatar asked May 15 '20 19:05

viopu


People also ask

Can you train a BERT model?

BERT is a Transformer-based model for natural language processing that was proposed in 2018 and open-sourced by Google. The model is trained on a large corpus of text, such as Wikipedia, and can be used for various tasks such as question answering and text classification.

What architecture does BERT use?

BERT uses the Transformer architecture, but it's different from it in a few critical ways. With all these models it's important to understand how they're different from the Transformer, as that will define which tasks they can do well and which they'll struggle with.


1 Answers

BERT can be viewed as a language encoder, which is trained on a humongous amount of data to learn the language well. As we know, the original BERT model was trained on the entire English Wikipedia and Book corpus, which sums to 3,300M words. BERT-base has 109M model parameters. So, if you think you have large enough data to train BERT, then the answer to your question is yes.

However, when you said "still achieve a good result", I assume you are comparing against the original BERT model. In that case, the answer lies in the size of the training data.

I am wondering why do you prefer to train BERT from scratch instead of fine-tuning it? Is it because you are afraid of the domain adaptation issue? If not, pre-trained BERT is perhaps a better starting point.

Please note, if you want to train BERT from scratch, you may consider a smaller architecture. You may find the following papers useful.

  • Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
  • ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
like image 56
Wasi Ahmad Avatar answered Sep 28 '22 02:09

Wasi Ahmad